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Data trees are trees in which each node, besides carrying a label from a finite alphabet, also carries 
a data value from an infinite domain. They have been used as an abstraction model for reasoning 
tasks on XML and verification. However, most existing approaches consider the case where only 
equality test can be performed on the data values. 

In this paper we study data trees in which the data values come from a linearly ordered domain, 
and in addition to equality test, we can test whether the data value in a node is greater than the 
one in another node. We introduce an automata model for them which we call ordered- data tree 
automata (ODTA), provide its logical characterisation, and prove that its emptiness problem is 
decidable in 3-NExpTime. We also show that the two- variable logic on unranked trees, studied by 
Bojanczyk, MuschoU, Schwentick and Segoufin in 2009, corresponds precisely to a special subclass 
of this automata model. 

Then we define a slightly weaker version of ODTA, which we call weak ODTA, and provide 
its logical characterisation. The complexity of the emptiness problem drops to NP. However, a 
number of existing formalisms and models studied in the literature can be captured already by 
weak ODTA. We also show that the definition of ODTA can be easily modified, to the case where 
the data values come from a tree-like partially ordered domain, such as strings. 

Categories and Subject Descriptors: F.1.1 [Models of Computation]: Automata; F.4.1 [Math- 
ematical Logic]: Computational logic 

General Terms: Languages 

Additional Key Words and Phrases: Finite-state automata. Two-variable logic. Data trees. Or- 
dered data values 



1. INTRODUCTION 

Classical automata theory studies words and trees over finite alphabets. Recently 
there has been a growing interest in the so-called "data" words and trees, that 
is, words and trees in which each position, besides carrying a label from a finite 
alphabet, also carries a data value from an infinite domain. 

Interest in such structures with data springs due to their connection to XML [Alon 
et al. 2003; Arenas et al. 2008; Bjorklund et al. 2008; David et al. 2012; Fan and 
Libkin 2002; Figueira 2009; Neven 2002], as well as system specifications [Bouyer 
et al. 2001; Demri et al. 2007; Segoufin and Torunczyk 2011], where many prop- 
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erties simply cannot be captured by finite alphabets. This has motivated various 
works on data words [Benedikt et al. 2010; Bojanczyk et al. 2011; Demri and Lazic 
2009; Grumberg ct al. 2010; Kaminski and Franccz 1994; Ncvcn et al. 2004], as well 
as on data trees [Bjorklund and Bojanczyk 2007; Bojanczyk et al. 2009; Figueira 
2010; Figueira and Segoufin 2011; Jurdzinski and Lazic 2007]. The common feature 
of these works is the addition of equality test on the data values to the logic on 
trees. While for finitely- labeled trees many logical formalisms (e.g., the monadic 
second-order logic MSO) are decidable by converting formulae to automata, even 
FO (first-order logic) on data words extended with data-equality is already un- 
decidable. See, e.g., [Bojanczyk et al. 2011; Fan and Libkin 2002; Neven et al. 
2004]. 

Thus, there is a need for expressive enough, while computationally well-behaved, 
frameworks to reason about structures with data values. This has been quite a 
common theme in XML and system specification research. It has largely followed 
two routes. The first takes a specific reasoning task, or a set of similar tasks, 
and builds algorithms for them (see, e.g., [Arenas et al. 2008; Bjorklund et al. 
2008; Schwentick 2004; Fan and Libkin 2002; Figueira 2009]). The second looks for 
sufficiently general automata models that can express reasoning tasks of interest, 
but are still decidable (see, e.g., [Demri and Lazic 2009; Bojanczyk et al. 2009; 
Jurdzinski and Lazic 2007; Segoufin and Torunczyk 2011]). 

Both approaches usually assume that data values come from an abstract set 
equipped only with the equality predicate. This is already sufficient to capture a 
wide range of interesting applications both in databases and verification. However, 
it has been advocated in [Deutsch et al. 2009] that comparisons based on a linear 
order over the data values could be useful in many scenarios, including data centric 
applications built on top of a database. 

So far, not many works have been done in this direction. A few works such 
as [Manuel 2010; Schwentick and Zeume 2010; Segoufin and Torunczyk 2011] are 
on words, while in most applications we need to consider trees. Moreover, these 
works are incomparable to some interesting existing formalisms [Fan and Libkin 
2002; Bojanczyk et al. 2009; Arenas et al. 2008; David et al. 2012; Jurdzinski 
and Lazic 2007; Demri and Lazic 2009; Lazic 2011] known to be able to capture 
various interesting scenarios common in practice. On top of that many useful tech- 
niques, notably those introduced in [Fan and Libkin 2002; Bojanczyk ct al. 2011; 
Bojanczyk et al. 2009; Jurdzinski and Lazic 2007], can deal only with data equality, 
and are highly dependent on specific combinatorial properties of the formalisms. 
They arc rather hard to adapt to other more specific tasks, let alone being gener- 
alised to include more relations on data values, and they tend to produce extremely 
high complexity bounds, such as non-primitive-recursive, or at least as hard as the 
reachability problem in Petri nets. Furthermore, most known decidability results 
are lost as soon as we add the order relation on data values. See, e.g., [Bojanczyk 
et al. 2011]. 

In this paper we study the notion of data trees in which the data values come from 
a linearly ordered domain, which we call ordered- data trees. In addition to equality 
tests on the data values, in ordered-data trees we are allowed to test whether the 
data value in a node is greater than the data value in another node. To the extent it 
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is possible, we aim to unify various ad hoc methods introduced to reason about data 
trees, and generalise them to ordered-data trees to make them more accessible and 
applicable in practice. This paper is the first step, where wc introduce an automata 
model for ordered-data trees, provide its logical characterisation, and prove that 
it has decidable emptiness problem. Moreover, we also show that it can capture 
various well known formalisms. 

Brief description of the results in this paper. The trees that we consider are 
unranked trees where there is no a priori bound in the number of children of a 
node. Moreover, we also have an order on the children of each node. We consider 
a natural logic for ordered-data trees, which consists of the following relations. 

— The parent relation E^, where E^{x,y) means that node x is the parent of node 

y- 

— The next-sibling relation E^, where E^{x,y) means that nodes x and y have 

the same parent and y is the next sibling of x. 
— The labeling predicates a(-)'s, where a{x) means that node x is labeled with 

symbol a. 

— The data equality predicate ~, where a; ~ y means that nodes x and y have the 
same data value. 

— The order relation on data -<, where x -< y means that the data value in node x 

is less than the one in node y. 
— The successive order relation on data -<suc, where x~<suc y means that the data 

value in node y is the minimal data value in the tree greater than the one in node 

X. 

We introduce an automata model for ordered-data trees, which we call ordered- 
data tree automata (ODTA), and provide its logical characterisation. Namely, we 
prove that the class of languages accepted by ODTA corresponds precisely to those 
expressible by formulas of the form: 

3Xi • • • BXn ^A^, (1) 

where 

— Xi, . . . , Xn are monadic second-order predicates; 

— is an FO formula restricted to two variables and using only the predicates E^, 

E^, r^, as well as the unary predicates Xi, . . . , Xn and a's. 
— tp is an FO formula using only the predicates ~, -<, ^suc, as well as the unary 
predicates Xi,. . ., Xn and a's. 

We show that the logic 3MS0'^{E^, E^,^), first studied in [Bojanczyk ct al. 2009], 
corresponds precisely to a special subclass of ODTA, where 3MS0^(i?j,, ~) 
denotes the set of formulas of the form (1) in which t/; is a, true formula. We then 
prove that the emptiness problem of ODTA is decidable in 3-NExpTime. Our 
main idea here is to show how to convert the ordered-data trees back to a string 
over finite alphabets. (See our notion of string representation of data values in 
Section 3.) Such conversion enables us to use the classical finite state automata to 
reason about data values. 

Then we define a slightly weaker version of ODTA, which we call weak ODTA. 
Essentially the only feature of ODTA missing in weak ODTA is the ability to test 
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whether two adjacent nodes have the same data value. Without such simple feature, 
the complexity of the emptiness problem surprisingly drops three- fold exponentially 
to NP. Wc provide its logical characterisation by showing that it corresponds 
precisely to the languages expressible by the formulas of the form (1) where i-p does 
not use the predicate ~. We show that a number of existing formalisms and models 
can be captured already by weak ODTA, i.e. those in [Fan and Libkin 2002; David 
et al. 2012; Manuel 2010]. 

We should remark that [David et al. 2012] studies a formalism which consists of 
tree automata and a collection of set and linear constraints.* It is shown that the 
satisfaction problem of such formalism is NP-complete. In fact, it is also shown 
in [David et al. 2012] that a single set constraint (without tree automaton and linear 
constraint) already yields NP-hardness. Weak ODTA are essentially equivalent to 
the formalism in [David et al. 2012] extended with the full expressive power of the 
first-order logic F0('~, ^, -<5„c). It is worth to note that despite such extension, 
the emptiness problem remains in NP. 

Finally we also show that the definition of ODTA can be easily modified to 
the case where the data values come from a partially ordered domain, such as 
strings. This work can be seen as a generalisation of the works in [David et al. 
2010] and [Kara et al. 2012]. However, it must be noted that [David et al. 2010; 
Kara et al. 2012] deal only with data words, where only equality test is allowed on 
the data values and there is no order on them. 

Related works. Most of the existing works in this area arc on data words. In 
the paper [Bojanczyk et al. 2011] the model data automata was introduced, and it 
was shown that it captures the logic 3MS0^(~, <, +1), the fragment of existential 
monadic second order logic in which the first order part using two variables only 
and the predicates: the data equality ~, as well as the order < and the successor 
-|-1 on the domain. 

An important feature of data automata is that their emptiness problem is de- 
cidable, even for infinite words, but is at least as hard as reachability for Petri 
nets. It was also shown that the satisfiability problem for the three-variable first 
order logic is undecidablc. Later in [David et al. 2010] an alternative proof was 
given for the decidability of the weaker logic 3MS0^(+1,~). The proof gives a 
decision procedure with an elementary upper bound for the satisfaction problem 
of 3MS0^(+1,'-) on strings. Recently in [Kara ct al. 2012] an automata model 
that captures precisely the logic 3MS0^(+1,~), both on finite and infinite words, 
is proposed. 

Another logical approach is via the so called linear temporal logic with freeze 
quantifier, introduced in [Demri and Lazic 2009]. Intuitively, these are LTL formu- 
las equipped with a finite number of registers to store the data values. We denote 
by LTL^[X,U], the LTL with freeze quantifier, where n denotes the number of reg- 
isters and the only temporal operators allowed are the neXt operator X and the 
Until operator U. It was shown that alternating register automata with n registers 
(RA„) accept all LTLJ-JXjU] Ian guages and the emptiness problem for alternating 
RAi is decidable. However, the complexity is non primitive recursive. Hence, the 



*We will later define formally what set and linear constraints are. 
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satisfiability problem for LTL|(X,U) is decidable as well. Adding one more register 
or past time operators, such as or U~^, to LTLj(X, U) makes the satisfiability 
problem undccidable. In [Lazic 2011] a weaker version of alternating RAi, called 
safety alternating RAi, is considered, and the emptiness problem is shown to be 
EXPSPACE-complete. 

A model for data words with linearly ordered data values was proposed in [Scgoufin 
and Torunczyk 2011]. The model consists of an automaton equipped with a finite 
number of registers, and its transitions are based on constraints on the data values 
stored in the registers. It is shown that the emptiness problem for this model is 
decidable in PSPACE. However, no logical characterisation is provided for such 
model. 

In [Bojanczyk ct al. 2011] another type of register automata for words was in- 
troduced and studied, which is a generalisation of the original register automata 
introduced by Kaminski and Prancez [Kaminski and Prancez 1994], where the data 
values also can come from a linearly ordered domain. Thus, the order comparison, 
not just equality, can be performed on data values. This model is based on the 
notion of monoid for data words, and is incomparable with our model here. 

It is shown in [Manuel 2010] that the satisfaction problem for FO^(+l,^s«c) 
over text is decidable. A text is simply a data word in which all the data values 
are different and they range over the positive integers from 1 to n, for some n > 1. 
We will see later that the satisfaction problem for F0^(-|-1, -<suc) can be reduced 
to the emptiness problem of our model. 

In [Schwentick and Zeume 2010] it is shown that the satisfaction problem of the 
logic F0^(<, ^) on words is decidable. This logic is incomparable with our model. 
However, it should be noted that F0^(<) cannot capture the whole class of regular 
languages. 

The work on data trees that we are aware of is in [Bojanczyk et al. 2009; Ju- 
rdzihski and Lazic 2007]. In [Bojanczyk et al. 2009] it was shown that the satis- 
faction problem for the logic 3MSO^(£^4,, S^, ~) over unranked trees is decidable 
in 3-NExpTime. However, no automata model is provided. We will see later how 
this logic corresponds precisely to a special subclass of ODTA. 

In [Jurdzinski and Lazic 2007] alternating tree register automata were introduced 
for trees. They arc essentially the generalisation of the alternating RAi to the tree 
case. It was shown that this model captures the forward XPath queries. However, 
no logical characterisation is provided and the emptiness problem, though decidable, 
is non primitive recursive. 

Organisation. This paper is organised as follows. In Section 2 we give some 
preliminary background. In Section 3 we formally define the logic for ordered-data 
trees and present a few examples as well as notations that we need in this paper. 
In Section 4 we present two lemmas that we are going to need later on. We prove 
them in a quite general setting, as we think they are interesting in their own. We 
introduce the ordered-data tree automata (ODTA) in Section 5 and weak ODTA 
in Section 6. In Sc;c,tion 7 we discuss a couple of the undccidable extensions of weak 
ODTA. In Section 8 we describe how to modify the definition of ODTA when the 
data values are strings, that is, when they come from a partially ordered domain. 
Finally we conclude with some concluding remarks in Section 9. 

ACM Transactions on Computational Logic, Vol. V, No. N, December 2012. 



6 • Tony Tan 



2. PRELIMINARIES 

In this section we review some definitions that we are going to use later on. We 
usually use F and S to denote finite alphabets. We write 2^ to denote an alphabet 
in which each symbol corresponds to a subset of F. In some cases, we may need the 
alphabet 2^ - an alphabet in which each symbol corresponds to a set of subsets of 
r. We denote the set of natural numbers {0, 1,2,.. .} by N. 

Usually we write £ to denote a language, for both string and tree languages. 
When it is clear from the context, we use the term language to mean either a string 
language, or a tree language. 

2.1 Finite state automata over strings and commutative regular languages 

We usually write Ai to dciiiotc a finite; state automaton on strings. The language 
accepted by the automaton M is denoted by £{M). 

Let S = {ai, . . . , ae}. For a word w G E*, the Parikh image of w is Parikh(ui) = 
(ni, . . . , ne), where is the number of appearances of in w. For a vector n, the 
inverse of the Parikh image of n is Parikh~^(n) = {w | w € S* and Parikh('ii;) = n}. 

For 1 < i < ^, a vector v = (ni,...,n^) € is called an i-base, if rii 7^ 
and rij = 0, for all j ^ i- A language £ is periodic, if there exist {£ + 1) vectors 
u,vi, . . . ,Vj( such that u G and each Vi is an i-base and 

C= IJ Parikh"^(u + /iiui H \-heve). 

hi,...,hi>0 

We denote such language £ by £(u, -Di, . . . , vt). 

A language C is commutative if it is closed under reordering. That is, \i w = 
bi ■ ■ - bm €: C, and cr is a permutation on {1, . . . , m}, then &cr(i) • • • &(T(m) G £• 

Theorem 2.1. [Ehrenfeucht and Rozcnbcrg 1981, CoroUary 2.2] A language is 
commutative and regular if and only if it is a finite union of periodic languages. 

2.2 Unranked trees, tree automata and transducers 

An unranked finite tree domain is a prefix-closed finite subset D of N* (words over 
N) such that u ■ i G D implies u ■ j & D for all j < i and u G N* . Given a finite 
labeling alphabet S, a S-labeled unranked tree t is a structure 

where 

— D is an unranked tree domain, 

— El is the child relation: {u, u ■ i) & E^ for all u,u ■ i G D, 

— E-y is the next-sibling relation: {u-i,u- (i + l)) G E^ for all u ■ i,u ■ {i + 1) S D, 
and 

— the a(-)'s are labeling predicates, i.e. for each node u, exactly one of a{u), with 
a e S, is true. 

We write Dom(t) to denote the domain D. The label of a node m in f is denoted by 
£abt{u). If £abt{u) = a, then we say that u is an a-node. 

An unranked tree automaton [Comon et al. 2007; Thatcher 1967] over E-labeled 
trees is a tuple A = {Q, S, 6, F), where Q is a finite set of states, F C Q is the set 
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of final states, and (5 : Q x S — > 2^^'^ is a transition function; we require a)'s 
to be regular languages over Q for all q G Q and a G S. 

A run of A over a tree t is a function pj, '■ Dom(t) — > Q such that for each node 
u with n children u • 0, . . . ,u • (n — 1), the word • 0) • • • • (n — 1)) is in 

the language 6{pj^{u),£abt{u)). For a leaf u labeled a, this means that u could be 
assigned a state q if and only if the empty word e is in 5(q, a). A run is accepting 
if p^(e) G -F, i.e., if the root is assigned a final state. A tree t is accepted by A 
if there exists an accepting run of A on t. The set of all trees accepted by A is 
denoted by C{A). 

An unranked tree (letter-to-letter) transducer with the input alphabet E and 
output alphabet F is a tuple T = {A, i-i) , where A is a, tree automaton with the set 
of states Q, and /xCQxExFisan output relation. We call such T a transducer 
from S to F. 

Let f be a E-labeled tree, and t' a F-labeled tree such that Dom(f) = Dom(f'). 

We say that a tree t' is an output of T on t, if there is an accepting run of A on 
t and for each u € Dom(f), it holds that {p_a{u) , £abt{u) , iabf (u)) G /j.. We call T 
an identity transducer, if £abt{u) = £abt'{u) for all u G Dom(f). We will often view 
an automaton A as an identity transducer. 

2.3 Automata with Presburger constraints (APC) 

An automaton with Presburger constraints (APC) is a tuple {A, ^) , where A is an 

unranked tree automaton with states qo, . . . ,qm and ^ is an existential Presburger 
formula with free variables xq,- - ■ ,Xm- A tree t is accepted by (.4,,^), denoted by 
t G C{A, ^), if there is an accepting run of ^ on w such that ^(no, . . . , rim) is 
true, where n, is the number of appearances of in p^. 

Theorem 2.2. [Seidl et al. 2004; Verma et al. 2005] The emptiness problem for 
APC is decidable in NP. 

It is worth noting also that the class of languages accepted by APC is closed 
under union and intersection. 

Oftentimes, instead of counting the number of states in the accepting run, we 
need to count the number of occurrences of alphabet symbols in the tree. Since 
we can easily embed the alphabet symbols inside the states, we always assume 
that the Presburger formula ^ has the free variables XaS to denote the number of 
appearances of the symbol a in the tree. 

As in the word case, we let Parikh(f) denote the Parikh image of the tree t. We 
will need the following proposition. 

Proposition 2.3. [Seidl et al. 2004; Verma et al. 2005] Given an unranked 
tree automaton A, one can construct, in polynomial time, an existential Presburger 
formula ^a{xi, ■ ■ ■ , xe) such that 

— for every tree t G JC,{A), £,j,{Pankh{t)) holds; 

— for every n = (ni, . . . ,n^) such that ^^(n) holds, there exists a tree t G C{A) 
with Pankh{t) = n. 
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3. ORDERED-DATATTREES AND THEIR LOGIC 

An ordcred-data tree over the alphabet S is a tree in which each node, besides 
carrying a label from the finite alphabet S, also carries a data value from N = 
{0,1,... }.t 

Let t be an ordered-data tree over E and u e Dom(t). We write va£t{u) to denote 
the data value in the node u. The set of all data values in the a-nodes in t is denoted 
by by Vt{a). That is, Vt{a) = {va£t{u) \ laht{u) — a and u £ Dom(t)}. Wc write Vt 
to denote the set of data values found in the tree t. We also write #t(a) to denote 
the number of a-nodes in t. 

The profile of a node u is a triplet {l,p,r) e {T, ±, *} x {T,_L,*} x {T,±,*}, 
where I = T and I — _L indicate that the node u has the same data value and 
different data value as its left sibling, respectively; I = * indicates that u docs not 
have a left sibling. Similarly, p = T, p = _L, and p = * have the same meaning in 
relation to the parent of the node u, while r = T, r = ±, and r = * means the 
same in relation to the right sibling of the node u. For an ordcred-data tree t over 
S, the profile tree of t, denoted by Profile(f), is a tree over E x {T, _L, *}^ obtained 
by augmenting to each node of t its profile. 

We write Proj(i) to denote the E projection of the ordered-data tree t, that is, 
Proj(t) is t without the data values. When we say that an ordered-data tree t is 
accepted by an automaton A, we mean that Proj(i) is accepted by A. An ordered- 
data tree t' is an output of a transducer T on an ordered-data tree t, if Proj(t') is 
an output of T on Proj(f), and for all u € Dom(t'), we have va£i'(u) = va£t{u)- 

Figure 1 shows an example of an ordered-data tree t over the alphabet {a, b, c} 
with its profile tree. The notation (^) means that the node is labeled with a and 
has data value d. 

3.1 String representations of data values 

Let t be an ordered-data tree over F. For a set 5 C F, let 

[S]t = n Vt{a) nf]W)- 

aes b(S 

Note that for each a € F, 

Vt{a) = U [S],. 

S s.t. aeS 

Since the sets [5']t's are disjoint, it is immediate that |Vt(a)| = \[^]t\- 

Let rfi < • • • < dm be all the data values found in t. The string representation 
of the data values in t, denoted by Vr(i), is the string Si ■ ■ ■ Sm over the alphabet 
2^ — {0} of length m such that di € [Si]t, for each i — 1, . . . , m. The notation [S]t 
is already introduced in [David et al. 2010; 2012], but not Vr(t). 

Consider the example of the tree t in Figure 1. The data values in t are 1, 2, 4, 6, 7, 



tfiere we use the natural numbers as data values just to be concrete. The results in our paper 
applies trivially for any linearly ordered domain. 
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& (4) (?) ("■(-■J-^)) C'^^'-'^^) ("•'Y'*0 



id (e) (?) ("'*i^''0 ("'*6^'*') C"^%^'*^) 

Fig. 1. An example of an ordered-data tree (on the left) and its profile (on the 
right). 

where 

{{b,c}], = {!}, 
[{a,b,c}], = {2}, 
[{a,b}]t = {4,7}, 
= {6}, 

[S]t = 0, for all the other S's. 

The string Vr(i) is Si ^2 S's S4 S5, where Si = {b,c}, S2 = {a,b,c}, S3 = S5 = 
{a, b} and S4 = {a, c}. 

3.2 A logic for ordered-data trees 

An ordered-data tree t over the alphabet S can be viewed as a structure 

t = {D, {a(-)}a£E, El, E^, ~, -<, ~<suc), 

where 

— the relations {a{-)}aeT,, E^.E^ are as defined before in Subsection 2.2, 
— u ~ V holds, if va£t{u) = va£t{v), 
— u -< V holds, if va£t{u) < va£t{v), 

— '^~<suc ^ holds, if va£t{v) is the minimal data value in t greater than va£t{u). 

Obviously, x~<suc y can be expressed equivalently as x ^ y A V2;(-i(a:: ^ z A z ^ 
y)). We include -<suc for the sake of convenience. We also assume that we have 
the predicates root(.x), first-sibling(a;), last-sibling(a;), and leaf(x) which stand for 
^y^Eiiy, x)), yy{^E^{y. ./;)), yy{^E^{x, y)), and Vy(-.£^4,(a;, y)), respectively. We 
also write x y to denote ^{x ~ J/). 

For O C {E^, E^,^, ~<, ~<suc}, we let FO(C) stand for the first-order logic with 
the vocabulary O, MSO(O) for its monadic second-order logic (which extends 
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FO(C') with quantification over sets of nodes), and 31^180(0) for its existential 
monadic second order logic, i.e., formulas of the form 3Xi . . . BX^ i>, where ip 
is an F0{0) formula over the vocabulary O extended with the unary predicates 

Xi, . . . , Xm- 

We let FO^(O) stand for FO(0) with two variables, i.e., the set of FO(e') for- 
mulae that only use two variables x and y. The set of all formulae of the form 
BXi . . . 3Xm V", where is an FO^(O) formula is denoted by 3MS0^(0). Note that 
JMBO^ {E^,E^) is equivalent in expressive power to MS0(E'4,,E'_>) over the usual 
(without data) trees. That is, it defines precisely the regular tree languages [Thomas 
1997]. 

As usual, we define Cdataiv) ^ the set of ordered-data trees that satisfy the 
formula (p. In such case, we say that the formula ip expresses the language Cdataif)-^ 

The following theorem is well known. It shows how even extending FO{E^, 
with equality test on data values immediately yields undecidability. 

Theorem 3.1. (See, for example, [Bojanczyk et al. 2011; Fan and Libkin 2002; 
Neven et al. 2004]) The satisifaction problem for the logic FO{Ei, E^, ~) is unde- 
cidable. 

One of the deepest results in this area is the following decidability result for the 
logic 3MSO^(^4,,£;^,~). 

Theorem 3.2. [Bojanczyk et al. 2009] The satisfaction problem for the logic 
3MSd^{E^,E^,^) is decidable. 

3.3 A few examples 

In this subsection we present a few examples of properties of ordered-data trees 
expressible in our logic. Some of the examples are special cases of more general 
techniques that will be used later on. 

Example 1. Let S = {a, 6}. Consider the language C^ata '^^ ordcrcd-clata trees 
over S where an ordered-data tree t G C^ata if ^"^^ only if there exist two a-nodes u 
and V such that u is an ancestor of v and either ?; ~ u or t; -< u. This language can 
be expressed with the formula 3X3Y3Z 95, where ip states that X contains only 
the node u, Y contains only the node v, Z contains precisely the nodes in the path 
from u to V, and v u or v -< u. Formally, the formula ip is the conjunction of the 
following. 

— |y| = |Z| = 1 and y C X and ZCX, 

—Wx{Y{x) a(.x)), 
— yx(Z{x) — ?■ a(x)), 

—yx{Y{x) ^ MX{y) ^ ^E^{y,x) A --E^{y,x) A ^E^{x,y))), 
-yx{Z{x) ^ MX{y) ^ --E^{x,y))), 
-Vx(-y(x) A Xix) ^ 3y{X{y) A E^{y, x))) 
- V.T;Vy(y(.x) A Z{y) {y ^xVy ^ x)), 



tXo avoid confusion, wc put the subscript data on Cdata to denote a language of ordered-data trees. 
We use the symbol C without the subscript data to denote the usual language of trees/strings 
without data. 
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where \Y\ = 1 stands for 3x{Y{x) A\/y{Y{y) y = x)), and F C X for Wx{Y{x) 
X{x)). 

Example 2. For a fixed set 5 C S and an integer m > 1, we consider the language 

^da7a such that t € if and only if \[S]t\ = m. 

The language /^^^^ can be expressed in 3MSO^(-E4,, E^, ~) by the formula </? of 
the form: 

3X\ ■ ■ ■ 3Xm </?i A </?2 A </?3 A (/J4 A ipn 

where c^i , . . . , <^5 are as follows. We pick an arbitrary symbol a € S. 

— (fi states that the predicates Xi,. . . , X^ are disjoint and each of them contain 
exactly one a-node. Formally, 

<^i:= /\ XinXj = ^ A f\ \Xi\ = l A /\ yx{Xi{x) ^ aix)) 

l<i^j<m l<z<m l<i<m 

— (P2 states that the data values found in Xi, . . . , X^ are all different. Formally, 
<^2 := A VxVy(Xi(a:) A Xj{y) ^ a; oo y) 

l<~i<j<m 

— (fs states that for each i e {1, . . . , m}, if a data value found in a node in Xi, then 
it must be found in some 6-node, for every b € S. Formally, 

<^3:= A yx{Xi{x)^ A 3y(6(y)Aar~y)) 

l<i<m beSyb^a 

— tfi states that for each i G {!,..., m}, if a data value found in a node in Xi, then 
it must not be found in every 6-node, for every b ^ S. Formally, 

<^4:= A ^xVy(^{Xi{x)A /\b{y))^xooy^ 

l<i<m b^S 

— <f5 states that for every a-node (recall that a € S) that does not belong to the 

Xi's, then cither it has the same data value as the nodes in the X^'s, or it has 
the data value not in [S]t. That its data value does not belong to [S]t can be 
stated as the negation of 

— for each b € S, there is a 6-node with the same data value; and 
— the data value cannot be found in every 6-node, for every b ^ S. 
Formally, 

ip5 := Va;(a(x)A \/ -^Xi{x) ipi{x) V -^ip2{x)) 

l<i<m 

tpi{x) := 3y(^ y Xi{y)Axr^y^ 

l<i<m 

i)2{x) := A 3y(6(y) Ax^y) A /\ yy{b{y) x 'to y) 

beS,bjta b(S 

Example 3. For a fixed set 5 C S and an integer m > 1, we consider the language 
^dJr°'^ ™^ such that t G C^'J^°^ if and only if \[S]t\ = (mod m). 
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This language ^^^1™°'^ '"^^ expressed in 3MSO^(-E'4,, ~) by the formula 
If of the form: 

aXo • • ■ 3X„_i3Fo • • • ^Ym-i^Z A • ■ • A (^9, 
where , . . . , i^g are as follows. We pick an arbitrary symbol a G S. 
— ifi states that the root node must be in Xq and Yq. Formally, 

(fi := Vx (root(a;) Xo{x) A Yq{x)) 

where root(a;) stands for \/y{^Ei{y,x)). 
— (p2 states that every node must belong to one of X,'s and one of Yi's. Formally, 

:=Var {\/ Xi{x)) A Vx {\/ Yi{x)) 

i i 

— ip3 states that the predicate Z contains only a-nodes. Formally, 

ifis := Vx {Z{x) a{x)) 
— ip4 states that every nodes in Z contain different data values. Formally, 

ip4 := VxVy {Z{x) A Z{y) Ax'^y^x = y) 
— (^5 states that the data values found in nodes in Z are precisely [S] . Formally, 

ip5 := Vx(z(x) ^ /\ 3y{b{y) A x ~ ?/)) A VxV?/((Z(x) A /\ b{y)) x ^ 

beS,b^a b^S 

A Vx(a(x) A Z(x) ipi{x) V -^ip2{x)) 
ipi{x) := 3y(^Z{y) Ax y^ 

ip2{x) := /\ 3y{h{y) Axr^y) A /\ \/y{b{y) x y) 

beS,b^a biS 

— ipQ,ip-j, (fs, ifg express the intended meaning of X/s and Yi's. Formally, 





:= Vx (^last-sibling(x) \fy(Ei{y,x) ipi{x,y) Aip2{x,y))^ 


tpi{x,y) 


:= /\Xi{y) AZ{y) -^Yi_i u,od m{x) 

i 


■<p2{x,y) 


:= f\X,{y)A^Z{y)^Yi{x) 

i 




:= Vx( /\(first-sibling(x) A Xi{x) Fi(x))) 

i 




:= \/xyy{E^{x, y) ^ f\ Yi{x) A Xj{y) Yi+j mod miv) 




0<27^j<m 


f9 


:= Vx(lcaf(x) A Z{x) Xi{x)) A Vx(leaf(x) A -Z(x) ^ Xo{x)) 



Example 4. Let S = {a,b}. Consider the language JO^ata ordered-data trees 
over S where an ordered-data tree t € £-data if ^^'^ only if all the a-nodes with data 
values different from the ones in their parents satisfy the following conditions: 
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— the data values found in these nodes are ah different; 

— one of the these data values must be the largest in the tree t. 

The language jC^^ta ^® expressed in 3MSO^(£'4., ^,^) with the following for- 
mula: 



4. TWO USEFUL LEMMAS 

In this section we prove two lemmas which will be used later on. The first is 
combinatorial by nature, and we will use it in our proof of the decidability of 
ODTA. The second is an Ehrenfeucht-Prai'sse type lemma for ordered-data trees, 
and we will use it in our proof of the logical characterization of ODTA. 

4.1 A combinatorial lemma 

Let G be an (undirected and finite) graph. For simplicity, we consider only the 
graph without self-loop. We denote by V{G) the sot of vertices in G and E{G) the 
set of edges. For a node u G V{G), we write dcg(u) to denote the degree of the 
node u and deg(G) to denote max{dcg(u) | u S V{G)}. 

A data graph over the alphabet F is a graph G in which each node carries a label 
from F and a data value from N. A node u G V{G) is called an o-node, if its label 
is a, in which case we write labciiu) = a. Wc denote by valaiu) the data value 
found in node u, and Vaia) the set of data values foimd in a- nodes in G. 

Lemma 4.1. Let G he a data graph over F. Suppose for each a G F, we have 
1^3(0)1 > deg(G)|F|+deg(G)-|-l. Then we can reassign the data values in the nodes 
in G to obtain another data graph G' such that V{G) = V{G') and E{G) = E{G') 



(1) for each u€V{G'), tabaiu) = eabo'iu); 

(2) for each a e F, Vaia) = VG>{a); 

(3) for each u,v e V{G), if [u, v) e E{G'), then \sIg'{u) ^ Yslc'iv)- 

Proof. Note that in the lemma the data graph G' differs from G only in the 
data values on the nodes, where we require that adjacent nodes in G' have different 
data values. 

In the following we write #0(^1) to denote the number of «-nodcs in G and 
K = deg(G). First, we perform some partial reassignment of the data values on 
some nodes. For each a G F, we pick |VG(a)| number of a-nodes in G'. Then we 
assign to these a-nodes the data values from Vcia). One o-node gets one data value. 
Such assignment can be done since obviously #0(0) > If #g{o) > \VG{a)\, 

then there will be some a-nodes in G' that do not have data values. We write 
va£G>{u) — ^, if u does not have data value. From this step we already obtain that 
Vg' (a) = Vg (a) for each a G F 

However, reassigning the data values just like that, there may exist an edge {u, v) 
such that va£G'{u) = va£G'{v) and va£G'{u), va£G'{v) ^ tt- We call such an edge a 



3X 




and 
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conflict edge. We are going to reassign the data values one more time so that there 
is no conflict edge. 

Suppose there exists an edge {u,v) S E such that vaiciu) = vaf.Qi{v) = d and 
suppose that u is an a-node, for some a € F. The data value d can only be found 
in at most |r| nodes in G'. Since deg{G) = K, the neighbours of those nodes (with 
data value d) arc at most i^lrl nodes. Now |VG(a)| = \VG'{a)\ > K\T\ + K + 1, 
there are at least K + 1 number of a-nodes whose neighbours do not get the data 
value d. Let ui, . . . ,Um be such a-nodes, where m > K + 1. From these nodes, 
there exists i such that vcdo'iui) ^ {vcdo'ix) \ {u,x) G E}. 

We can then swap the data values on the nodes u and Ui, and this results in one 
less conflict edge. We repeat this process until there is no conflict edge. Now it is 
straightforward that 

(1) for each u gV, labciu) = £abG'{u); 

(2) for each a e F, Vcia) = Vcia); 

(3) for each u,v € V, if {u,v) G E and vaic {u) , vote {v) ^ tt, then vaiciu) ^ 
voIg' (w). 

What is left to do now is to assign data values to the nodes u, where valciu) — fl. 
For each a-nodc, where vaI',Qi{u) = ft, we pick the data value d £ Vc^a) = Vcia) 
which is not assigned to any its neighbour. Such data value exists since |Vg'(ci)| > 
K\r\ + K + 1 > K + 1. Such assignment will not violate condition (3) above, thus, 
we get the desired data graph G'. This completes the proof of Lemma 4.1. □ 

4.2 An Ehrenfeucht-FraTsse type lemma 

We need the following notation. A ^-characteristic function on the alphabet F, is 

a function / : F — > {0, 1,2, ... , k}. Let J^r,k be the set of all such fc-characteristic 
functions on F. A function / e J^r,k is a ^-characteristic function for a set S, if 
f{a) e {1, 2, . . . , k}, for all a€S, and /(a) = 0, for all a^S. 

Let t be an ordered-data tree and di < ■ ■ ■ < dm be the data values found in 
t. The k' extended representation of t is the string Vp(t) = {Si, /i) • • • {Sm, fm) € 
2^" X J^r,k such that • • • Sm = Vr(i) and for each i G {1,2,..., m} and for each 
a e r, 

(1) fi is a fc-characteristic function for the set Si, 

(2) if 1 < fi{a) < k — 1, then there are fi{a) number of a-nodes in t with data 
value di, 

(3) if fi{a) = k, then there are at least k number of a-nodes in t with data value 
di. 

We assume that in every formula in MSO(^, -<, -<suc) all the monadic second- 
order quantiflers precede the first-order part. That is, sentences in MSO(~, -<, -<suc) 
are of the form: (p := QiXi ■ ■ ■ QgXg tp, where the Xj's are monadic second-order 
variables, the Q^'s are 3 or V and ip € F0(~, -<, -<stic) extended with the unary 
predicates Xi, . . . ,Xs. We call the integer s, the MSO quantifier rank of (p, denoted 
by MS0-qr((/3) = s, while we write FO-qr((/?) to denote the quantifler rank of ip, that 
is the quantifier rank of the first-order part of (p. 
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Lemma 4.2. Let t\ and he ordered- data trees over T such that Vp (ti) = 
Vp^ (^2)- For any MSO{'^,<,-<suc) sentence ip such that MSO-qr{ip) < s and 
FO-qr{(p) < k, ti \= (p if and only if t2 \= ^■ 

Proof. The proof is by Ehrcnfcucht-Frai'ssc game for MSO of (s + k) rounds, 
with s rounds of set-moves and k rounds of point-moves. We can assume that 
the set-moves precede the point-moves. See, for example, [Libkin 2004], for the 
definition of Ehrcnfcucht-Frai'ssc game. 

Before we go to the proof, we need a few notations. Let ti and t2 be ordered- 
data trees over F. For (a,d) e F x N, we write Pt-i^{a,d) = {u | £abt^{u) = 
a and vain (u) ^ d} - the set of nodes in ti with label a and data value d. We can 
define similarly Pt^ia, d) for t2- 

Let O C {~, ^, -<suc\- Let wi, . . . , Wfc e Dom(fi) and v\,...,Vk S Dom(f2), for 
some ordered-data trees ti and ^2- The mapping (ui, . . . , Uk) n> (ui, . . . , Vk) is a 
partial O-isomorphism (with equality) from t\ to ^2, if it is a partial isomorphism 
with regards to the vocabulary 0, and if ui = Uf, then vi = Vf. 

We are going to describe Duplicator's strategy for winning the Ehrenfcucht- 
Fraisse game for MSO of s rounds of set-moves, followed by k rounds of point 
moves. We start with the set-moves. 

Duplicator's strategy for set-moves: Suppose that the game is already played for I 
rounds, where Xi, . . . ,Xi and Yi, . . . , Y; arc the sets of positions chosen in ti and 
t2, respectively. For each /C{1, ...,/}, define the following set: 

PtM,d;I) = PtAa,d)nf]Xinf]x'j 

iei 30 

Pt,ia,d;I) = Pt,{a,d)nf]Yinf]Y^ 

iei 30 

Duplicator's strategy is to preserve the following identity: for every (a, rf) € S x N 
and every / C {1, . . . , 

—If the cardinality \Pt^{a,d; I)\ < fc2"-', then \Pt,{a,d;I)\ = \Pt^{a,d: I)\. 
—If the cardinality \Pt,{a,d: I)\ > fc2"-', then also \PtJa,d;I)\ > fc2™-'. 

Now suppose that on the {I + 1)*'^ set-move. Spoiler chooses a set X of positions on 
ti. Duplicator chooses a set Y of positions on t2 as follows. For each / C {1, . . . , Z}, 
there are four cases: 

(1) \Pt,{a,d;I)nX\ and \Pt,{a,d; I) nX\ < fc2™-'"i. 

Then, \Pt^{a,d;I)\ < fc2'"~', which by induction hypothesis, implies \Pt2{a,d;I)\ = 
\Pti{a,d;I)\. Duplicator picks \Pti{a,d;I)nX\ number of points from P^^ (a, d; /), 
and declares them "belong to Y." The rest of the points from Pf^ (a, d; I) are 
declared "not belong to F." 

Obviously, |Pti (a, dj/) n X| = \Pt^{a,d; I) r]Y\ < A;2™-'-i and \Pt^{a,d;I) D 
X\ = \PtJ{a,d;I)nY\ < k2"^-^-\ 

(2) \Pt, (a, d;I)nX\< A;2"-'-i and \Pt, (a, d; /) n X| > k2"'-^-\ 

In this case, either Pt^{a,d;I) < A;2"* or > k2"^. In either case there are 
\Pti{a,d;I) n X\ number of points from Pt^{a,d;I) which Duplicator declares 
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as "belong to Y." The rest of the points from Pt^{a,d\I) arc declared "not 

belong to F." _ 

Obviously, \Pt,{a,d-I)r\X\ = \Pt^{a,d;I)f}Y\ and \Pt^{a,d-I)r\Y\ > A;2'"-'-i. 

(3) \Pt,{a,d;I)nX\ > A;2™-'-i and \Pt,{a,d; I) nX\ < fc2™-'-i. 
Similar to Case 2. 

(4) \Pt,{a,d;I)r\X\ > fc2"-'"i and \Pi ,{a. d: I) r\X\ > fc2™-'-i. 

Then, \Pt^{a,d;I)\ > fc2"-', and so \Pt^{a,d;I)\ > kT^'K Duplicator declares 
half of the points in Pt^ (a, d; I) as "belong to F" and the other half as "not 
belong to F." _ 
Obviously, \Pt^{a,d-I) r\Y\ and \Pt^{a,d;I) f^Y\ > fc2"-'-i. 

Now after m rounds of set-moves, we have the following identity: for every (a, d) e 
S X N and every / C {1, . . . , m} 

— If the cardinality |Pti (a, d]I)\ < k, then \Pti (a, d;I)\ = \Pt2 (a, d;I)\. 
— If the cardinality \Pti{a,d;I)\ > k, then also \Pt2{a, d; I)\ > k. 

This ends our description of Duplicator's strategy for set-moves. Now we describe 
Duplicator's strategy for point-moves. 

Duplicator's strategy for point-moves: Suppose that the game is now on Zth step. 
Let {ui, . . . ,ui) H- >■ {vi, . . . ,vi) be a partial {~, -<s„c}-isomorphism, where < 
I < k — 1. Suppose Spoiler chooses a node ui+i on ti such that va£ti{ui+i) is the 
j'*^ largest data value in ti . 

— If Mi+i = ui' , for some G {1, . . . , Z}, Duplicator chooses wj+i = vi' on i2- 

— If ^ {ui, . . . ,ui}, Duplicator chooses on <2 such that Vi+i ^ {vi, . . . ,vi} 

and £abti{ui+i) = iabt^ivi+i) and va£t2{,vi+i) is the j*^ largest data value in t2- 

Such a node exists, as V'^(wi) = V''(w2)- 

In cither case (ui, . . . , i— !> (wi, . . . , is a partial ^, -<s„c}-isomorphism. 
This completes the description of Duplicator's strategy and hence, our proof. □ 

5. AUTOMATA FOR ORDERED-DATA TREE 

In this section we are going to introduce an automata model for ordered-data trees 
and study its expressive power. 

Definition 5.1. An ordered-data tree automaton, in short ODTA, over the al- 
phabet S is a triplet S = (T, Fq), where T is a letter-to-letter transducer from 
E X {T, _L, to the output alphabet T; M. is an automaton on strings over the 
alphabet 2^^; and Tq C T. 

An ordered-data tree t is accepted by 5, denoted hj t & Cdata{S), if there exists 
an ordered-data tree t' over F such that 

— on input Profile(f), the transducer T outputs t'] 

— the automaton M. accepts the string Vrit'); and 

— for every a G Fq, all the a- nodes in t' have different data values. 

We describe a few examples of ODTA that accept the languages described in Ex- 
amples 1, 2, 3 and 4. 
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Example 5. An ODTA 5" = {T,M,ro) that accepts the lang uage jC^do^id in Ex- 
ample 1 can be defined as follows. The output alphabet of the transducer T is 
r = {a,/?, 7}. On an input tree t, the transducer T marks the nodes in t as fol- 
lows. There is only one node marked with a, one node marked with /3, and that 
the a-node is an ancestor of /3. The automaton Ai accepts all the strings in which 
the position labeled with S 3 [3 is less than or equal to the position labeled with 
S' 3 a. (These two positions can be equal, which means S — S' .) Finally, Fq = 0. 

Example 6. An ODTA S^'"" = {T,M,To) that accepts the language £^„7a 
Example 2 can be defined as follows. The transducer T is an identity transducer. 
The automaton A4 accepts all the strings in which the symbol S appears exactly 
m times, and Fq 0. 

Example 7. An ODTA 5'^' = {T^MS'q) that accepts the language 

^d'ata^°'^ ™^ Example 3 can be defined as follows. The transducer 7" is an identity 
transducer. The automaton M accepts a string in which the number of appearances 
of the symbol 5 is a multiple of m, and Fg = 0. 

Example 8. An ODTA 5"* = {T,M,Tq) that accepts the language C^lta in 
Example 4 can be defined as follows. The output alphabet of the transducer T is 
F = {a,0}. The transducer T marks the nodes as follows. A node is marked with 
a if and only if it is an a-node and it has different data value from the one of its 
parent. All the other nodes are marked with (3. The automaton M accepts a string 
V if and only if the last symbol in v contains the symbol a, while Fq = {a}. 

The following proposition states that ODTA languages are closed under union and 

intersection, but not under negation. Wc woiild like to remark that being not closed 
under negation is rather common for decidable models for data trees. Oftentimes 
models that are closed under negation have undecidable emptiness/satisfaction 
problem. 

Proposition 5.2. The class of languages accepted by ODTA is closed under 

union and intersection, but not under negation. 

Proof. For closure under union and intersection, let Si = (Ti,A4i,Fo) and 
S2 = {T2,M2,Tq) be ODTA. The union Cdata{Si) U Cdaia{S2) is accepted by an 
ODTA which non-deterministically chooses to simulate either Si or 1S2 on the input 
ordered-data tree. The ODTA for the intersection £.data{Si) n Cdata{<S2) can be 
obtained by the standard cross product between Si and S2- 

That ODTA languages are not closed under negation follows from the fact that 
the negation of the language in Example 5 is not accepted by ODTA. The proof is 
rather straightforward, thus, omitted. □ 

We should remark that in Section 7 we will discuss that extending ODTA with 
the complement of languages of the form in Example 5 will immediately yield 
undecidability. 

Theorems 5.3, 5.4 and 5.5 are the main results in this paper. Theorem 5.3 below 

provides the ODTA characterisation of the logic 3MS0'^{Ei,E^,r^) and its proof 
can be found in Subsection 5.1. 
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Theorem 5.3. A language Ldata is expressible with an 3MSd^{E^, E^,'--^) for- 
mula if and only if it is accepted by an ODTA S = {T,M,To), where JC{M) is a 
commutative language. 

Theorem 5.4 provides the logical characterisation of ODTA. The proof can be 
found in Subsection 5.2. 

Theorem 5.4. A language £data is accepted by an ODTA if and only if it is 
expressible with a formula of the form: 3Xi ■ ■ ■ 3Xm (p Aip, where (fi is a formula 
from F(f'{E\^,E^,^), and ip from FO{^,<,-<suc), both extended with the unary 
predicates Xi , . . . , ■ 

Finally, we show that the emptiness problem for ODTA is decidable in Theo- 
rem 5.5. The proof can be found in Subsection 5.3. 

The decision procedure in Theorem 5.5 runs in 3-NExpTime, while the decision 
procedure for 3MSO^(i?4,, i?-,., ~) proposed in [Bojanczyk et al. 2009] also runs in 
3-NExpTime. However, we should remark that if we use our algorithm for the 
satisfaction problem of 3MSO^(£'4, E^, ^) via the translation in Theorem 5.3, the 
complexity will jump to 5-NExpTime, since there is a double exponential blow-up 
in the translation. 

Theorem 5.5. The emptiness problem for ODTA is decidable. 
5.1 Proof of Theorem 5.3 

We will need the following proposition which states that every 

formula can be syntactically rewritten to a normal form for 3MSO^(£J4,, E^, ~). 

Proposition 5.6. [Bojanczyk etal. 2009, Proposition 3.8] Every3MS(f{Ei^,E^,'^ 
) formula ^ can be rewritten into a normal form of the form: 

3Xi • ■ • 3Xto ifi, 

where is a conjunction of formulae of the form: 

(Nl) VxVy (a(x) A 5{x, y) A ^{x, y) ^ 

(N2) Vx (root(x) -> a{y)), 

(N3) V.T (first-sibling(.T) — > a{y)), 

(N4) Vx (last-sibling(x') — > a{y)), 

(N5) Va; (leaf(a;) a{y)), 

(N6) \/x\/y {a{x) A a{y) Ax^y^x = y), 

(N7) VxBy {a{x) ^ Axr^y), 

where a(x),/3{x) is a conjunction of some unary predicates and its negations, 6{x,y) 
is either Ei{x,y) or E^{x,y), and £,{x,y) is either x ^ y orxnoy. 

We should remark that if is a conjunction of formulae of the forms (N1)-(N5) 
above, then there exists a tree automaton A over the alphabet E x {T, ±, such 
that for every ordered-data tree t, 

t\=^ if and only if Profile(t) is accepted by A. 
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Such construction is straightforward from the classical automata theory. See, for 
example, [Thomas 1997]. 

We divide the proof of Theorem 5.3 into Lemmas 5.7 and 5.8 below. 

Lemma 5.7. For every formula * G 3MSd^{E^,E^,r~.), there exists an ODTA 
<S* = (T, A^jFo) such that Cdatai^) = C.data{S\v) and C{M) is commutative. 

Proof. Applying Proposition 5.6, we can rewrite the formula 4* in its normal 
form. Furthermore, we can rewrite the formula ^ into the form: 

3Xi ■ ■ ■ 3Xm if, 

where is a conjunction of formulas of the form: 

(NO') Xi,. . . ,Xm are pairwise disjoint, and \/x{a{x) — >■ a'{x)). 

(Nl') VarV?/ {a'{x) ^6{x,y) h^x^v) ^ P'iv)), 

(N2') Va; (root(.T) ^ 

(N3') Vx (first-sibling(x) ^ a'(?y)), 

(N4') Vx (last-sibling(a;) a'{y)), 

(N5') Va; (leaf(a;) a'{y)), 

(N6') VxVy (a'(x) Aa'{y) Ax y ^ X = y), 

(N7') WxBy ia'{x)^(3'{y)Axr.y), 

where a'{x),f3'{x) are disjunctions of some of the Xi's, and 6{x,y) and ^{x,y) are 
the same above. 

The ODTA 5* = {T,M,Tq) is defined as follows. 

— The transducer T checks whether the formulas (N0')-(N5') are satisfied, with 
the output alphabet P = {Xi, . . . , X^} where a node is labeled with Xj if and 
only if it belongs to X^. 

The construction of such transducer is straightforward, thus, omitted. See, for 
example, [Thomas 1997]. 
— Po consists of the Xi's, where there exists A C {Xi, . . . , Xm} and Xi & A and a 
formula of the form (N6') 

VxVy ( Y Xj(x)A y Xj{y)Ax^y^x = y), 

in (f. 

— the automaton M. accepts the language (2^^i' - '^"*^ — ("Pi U'P2))*, where 

there exists a formula 
Vi:= { S Vx3y (V^g^ X{x) ^ VxeB ^iv) f^^^v) 
in such that 5nA^0 but 50^ = 

there exists a formula 
V2 := { S VxVy (VxeA X{x) A V^eA Hv) Ax-y^x = y) 
in if such that \Sr]A\>2 

That C-{M.) is commutative is trivial. That S accepts precisely the language 
'Cdato(^) can be deduced from the following. 
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— That T ensures that formulas N0'-N5' are satisfied. 

— That Fo contains precisely the symbols XiS where all Xj-nodes are supposed to 
contain different data values. 

— That for every ordered-data tree t,t\= Mx3y {\/xeA ^{^) ~^ Vxes X{y)Ax ~ y) 
if and only if [S]t = for all S such that 5 n A ^ but 5 n B = 0. 

— That for every ordered-data tree t, t \= \/x\/y (VxeA^(^) ^ y xeA-^iv) ^ 2^ ~ 

J/ — > X = y) if and only if 

—[S]t = for all 5* such that \S H A\ > 2; and 

— for all X G A, t \= VxVy {X{x) AX{y) Ax y ^ x = y), which is captured by 
the condition imposed by Fq. 

This concludes the proof of Lemma 5.7. □ 

Lemma 5.8. For every ODTA S = (T, Al,Fo), where C{M) is a commutative 
language, there exists a formula ip e 3MSCf{Ei,E^,^) such that Cdataiv) = 

Proof. Let Qj- = {qq, .... q„i} and F = {ai, . . . , a^} be the set of states and 
the output alphabet of the transducer T, respectively. Let £ = 2''"! — 1. 

By Theorem 2.1, JC{M) is a finite union of periodic languages. Let I be the finite 
set of {£ + l)-tuple of N^- vectors such that 

[J C{u,vi,...,ve) = jO.{M). 

Let / = {{ui,vi^i, . . . ,vi^i), . . . ,{un,Vn,i, ■ ■ ■ ,Vn,e)} and Si,...,Se be the enumer- 
ation of non-empty subsets of F. 

First for {u,Vi,. . . , v^) G I, we construct a formula ^ {u,vi,...,ve) G 3MS0^(Sj_, E^, ' 
) where 



i e if and only if 



there exists hi, . . . ,he > such that 
(|[S'i]t|, . . . , \[Si]t\) = u + hivi + ■■■ + heve 

We denote by Vi the non-zero entry of Wj. This formula ^(u,vi,...,ve) is as follows. 

3Wi,i---W^i,«i ^We,i---We,u^ 
3Xifi ■ ■ ■ Xi^y^_i 3Yifi ■ ■ - Yi 

,V1—1 ^1 

3Xifi ■ ■ ■ Xi^y^-i 3Yifi ■ ■ ■ Yi^y^-l Zi 
/\Wi,i,...,Wi,u,nZi = (D 

i 

i 

A /\^\[Si]\=Vi (mod m)(-^i,0) • • • ? ^i,'!;^-!? ^i,05 • • • , ^z,?;^-!? -^i) 

i 

where <yJs,,„,(Wi,i, . . . , and (ps^.^ (mod vi)iXifi, . . . , Fj^o, . • . , .^i) 

are the formulas for the languages C'^^^J and ^^ata™°'^ i'^ Examples 2 and 3, 
respectively. 
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The desired formula (p is: 



3Xqg ■ ■ ■ 3Xq 



3^ai • ■ ■ ^^a,, 

ifr f\ \J 'P(u,vi,...,ve) 
{u,vi,...,ve)el 



where 



— the unary predicates Xq„ , . . . , Xg^^^ , Xa^ , . . . , Xa^ are supposed to represent the 

states and the output alphabets of 7", respectively; 

— the formula ip-y expresses the behaviour of the transducer T - that is, a tree 
satisfies px in which for every node u G Dom(t), Xq.(^ and holds, if 

there exists an accepting run of T on f in which the node u is labeled with 

and output a^; 

— the predicates X^^^^^. ^)'s and the formulas are as in the 

formula ^ {u^v-y,—,vi) defined above. 

This completes the proof of the lemma. □ 
5.2 Proof of Theorem 5.4 

In this subsection for every ordered-data tree t, we assume that the data values in t 
are precisely the natural numbers in the range [l..m], for a positive integer m > 1. 
That is, if di < (^2 < • • • < are the data values in t, then di = 1, <i2 = 2, . . ., 
dm = m. 

We need the following notion. Let t be an ordcrcd-data tree over the alphabet 
r and let V'^(t) — {Si, /i) • • • (Sm, fm) be its /c-cxtendcd string representation of 

data values in t. We define an alternative way of writing V (<), denoted by V (t) = 
Pi - ■ ■ Pm, as follows. For each i = 1, . . . , m, 

Pi= [j {a} X {!,... Jiia)}. 

aeSi 

That is, the string v'°(i) is over the alphabet 2'"^^^' - ''^^ and for each a G Si, Pi 
consists of fi{a) copies of a, where each copy has an "index" from {1, . . . , /j(a)}. 

- k 

We define the canonical string for V (t) = Pi ■ ■ ■ Pm as follows. 

/l(£il) flia-e) fm(ai) fm(at) 

Pi 













-CD- 


-CO- 


■■(T) 



n- 


■n- 


■■n- 













When we view such canonical string as a tree^, by Lemma 4.2, for every F0(~, -<) 
formula ip, 

~ k 

t \= ip if and only if the canonical string of V {t) \= (p. 



§That is, a string is a tree in which each node has at most one child. 

ACM Transactions on Computational Logic, Vol. V, No. N, December 2012. 



22 • Tony Tan 



First, we prove the "if" direction of Theoreni 5.4. Let \l/ be a formula of the 
form: 

3Xi ■ ■ ■ 3Xm ^Atp, 

<^ is a formula from FO'^{E^, E^,^) and ip from F0(~, ~<), both extended with the 
unary predicates Xi, . . . , X„i . 

By Proposition 5.6, we can rewrite (with additional unary predicates) the formula 
(fi into a conjunction of formulae of the form N1-N7 as stated in Proposition 5.6. 
Then we further rewrite it so that the unary predicates Xi , . . . , X^, are pairwise 
disjoint and that the formula (f is conjunction of the form: 

(NO') a formula ^ that states that Xi,. . . , Xm are pairwise disjoint and that 

Vx {a{x) —7- a{x)), 

(Nl') VxVy {a{x) A 5{x, y) A ^{x, y) ^ ^(j/)), 

(N2') Mx (root(a;) ^ «(?;)), 

(N3') V.T (first-sibling(.T) ^ a(y)), 

(N4') Va; (last-sibling(x) a{y)), 

(N5') Vx (lcaf(.T) ^ a(y)), 

(N6') VxVy (a(x) A a(y) Aa;~y— >-x = y), 

(N7') VxBt/ (a(a;) ^(?/) A a; - y), 

where a(x), /3(x) arc disjunctions of some of the unary predicates Xi. . . . , X^,- 
We will describe the ODTA S =■ {T,M,Ta) for the formula * as follows. 

— The output alphabet of T is F = {Xi, . . . , Xm} x {1, . . . , k}, where k = FO-qr(^). 

— The transducer simulates the formula Nl'-N5' above and simultaneously per- 
forms the following. On input tree t over the alphabet S, on each node u £ 
Dom(i), T "guesses" an element {S,f) e 2^ x J^{Xi,...,Xm},k and remember it in 
its state when reading u such that 
—if u G Xi, then Xi £ S, 

— if 1 < f{Xi) < k, then there are exactly f{Xi) — 1 nodes other than u belonging 
to Xi, 

— if f{Xi) = k, then there are at least f{Xi) — 1 nodes other than u belonging 

to Xi, 

— T outputs {Xi,j) for some j e {1, • ■ • , fc}, 

— if j < k, then there is no other node in which T outputs (Xj, j), 
— there are nodes in which T outputs {Xi,j') for each / < j. 

—ro = {Xi,...,Xm}x{i,...,k-i}. 

—Let C be an automaton over the alphabet {Xi, . . . U 2{^i- -'^'">^-ti— '^i 

that checks whether the input string is a canonical string that satisfies the formula 
V. 

Formally, the automaton C expresses the sentence tp G F0(<) (for string), where 
ip is obtained from (p by the following procedure. 
— If tp is Qx ^, where Q e {V, 3}, then tp is 

Qx \/x,ix)^l 
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— If ^ is X = y, then ^ is also x = y. 

— If tp is X y, then tjj states "there is no position in between x and y labeled 

with any symbol from 2{-^i-- '-^™}><'fi' - ''=>." 
— If tp is X ^ y, then tp states "there is at least one position in between x and y 

labeled with a symbol from 2-t^i' - '^"'>^{i' - ''=}." 
- If is x^sucV, then tp states "there is exctly one position in between x and y 

labeled with a symbol from 2{^i.- '^'">x{i. • ''=}." 
It is straightforward to show that the automaton C accepts 

if and only if the canonical string of A • • • Pm satisfies if. 

Then the automaton M. behaves precisely like C, while projecting all the symbols 
from {Xi, ... , Xm} to empty string. 

Next, we prove the "only if" direction of Theorem 5.4. Let C = Cdata{S), where 

S = (r,A^,ro), and 

— Q = {ill • • • J <Zn} be the states of T; 

— P = {pi, . . . ,Pm} be the states of M; 

— r = {«!, . . . ,ae} be the output alphabet of T. 

We denote by S the input alphabet of T. 
The desired formula for £ is of the form: 

3Xg, ■ ■ ■ 3Xg„ 3X^, ■ ■ ■ 3X„, 3Xp, ■ ■ ■ 3Xp^ *r A *m A *ro 

where 

— the unary predicates Xq^ , • • • , Xq^ , X^ , • . • , X^^ , Xp^ , . . . , Xp^ are supposed to 
represent the states, the output alphabets of T, and the states of M, respectively; 

— the formula \l/7- expresses the behaviour of the transducer T - that is, a tree 
satisfies ^7- in which for every node u e Dom(<), Xq.{u) and X^^. (m) holds, if 
there exists an accepting run of T on t in which the node u is labeled with Qi 

and output a^; 

— the formula '^'m expresses the behaviour of the automaton A4; 

— the formula expresses the property that for every ai G Fq, all the nodes 
belonging to X^ contain different data values, which is 

/\ \/xyy{Xa{x) A X„{y) Ax'-^y^x = y). 

aer„ 

The construction of the formula 'if is rather standard, thus, omitted. Wc will show 
the construction of the formula 'i^M- Let denote the following formula 

\J Xa,{x) A /\ 3y{X^,{x) A a; ~ 2/) A /\ Vy(X„.(y) x oo y), 

which states that the data value on the node x belongs to [S]. The formula "^^m 
expresses the following properties. 

ACM Transactions on Computational Logic, Vol. V, No. N, December 2012. 



24 • Tony Tan 

— That the node contains the minimal data value belongs to Xp-^ . Formally, it can 
be written as follows. 

\/x{yyx ~<yVx'^y^ Xp^ (x)) 

— That the transition ij. of Ai must be "respected." Formally, it can be written as 
follows. 



— That the node contains the maximal data value belongs to one of the final states 
of M, denoted by F. Formally, it can be written as follows. 



5.3 Proof of Theorem 5.5 

The proof of Theorem 5.5 is as follows. First, wc prove that for each ODTA 
S, if Cdata{S) 9^ 0, then Cdata{S) contains a data tree with "small model prop- 
erty" (Lemma 5.9). Then we describe a procedure, that given an ODTA S, checks 
whether C{S) contains a data tree with "small model property," by converting the 
ODTA S into an APC {A, ^). Since the emptiness of APC is decidable. Theorem 5.5 
follows immediately. 

Wc need a few terminologies. A set of nodes in a data tree t is called connected, 
if it is connected in the graph induced by E'j, and E^. A zone in a data tree t is a 
maximal connected set of nodes with the same data value. The outdegree of a zone 
Z is the number of different zones to which there is an edge (either or -E^) from 



Let S = {T,M,ro) be an ODTA, where T is a transducer from E to F. Let Q 

be the set of states of T- For a tree t £ Cdata{S), its extended tree t (with respect 
to the ODTA S) is a tree over the alphabet E x {T, _L, x Q x F, where 

— the projection of t to S x {T, _L, *Y is Profile(f); 

— the projection of f to Q is an accepting run of T on t\ 

— the projection of f to F is an output of T on t. 

The following Lemma is simply an adaptation of [Bojanczyk et al. 2009, Propo- 
sition 3.10] to the case of ODTA. The proof is via cut-and-pastc. where given an 
ordered-data tree t over the alphabet S where t has "many" zones in which the 
outdegree is "large," we can cut some nodes in t and paste it in another part of 
t without effecting the set Vt(a)'s for each a G S. The aim of such cut-and-paste 
is to reduce the number of zones in t with large outdegree. We give the formal 
statement below. 

Lemma 5.9. [Compare [Bojanczyk et al. 2009, Proposition 3.10]] For every ODTA 

S = {T , M,,Tq) over the alphabet S, if Cdata{S) ^ 0, then there exists a data tree 
t € i^data{S) in which there are at most K^'^^ ^ zones with outdegree > K'^^ \ 
where K = 27 - \Q\ ■ \T\ and Q is the set of states ofT and F the output alphabet 




z. 



OfT. 
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Proof. Let S = {T, M, Tq) be an ODTA over the alphabet S, and Q is the set 
of states of T and T the output alphabet of T. Suppose that to G jCdatai^)- We 
will work on the extended tree to of to. The aim is to convert to into another tree 

i over the alphabet S x {T, ±, *}-^ x Q xT such that 

(1) the number of zones in t with outdegree > K^^ ) is bounded by K'^^^ \ 

(2) the {T, _L, projection of t is the profile of each node, 

(3) the Q projection of i is an accepting run of T on the S x {T, ±, projection 

of i and the output is its F projection, 

(4) for each (a, (Up, r),q,b) G T, x {T, _L, x Q xT the set of data values found 
in the {a, {l,p,r),q, b)-nodes in to is the same as the set of those found in 
(a, {l,p, r), q, 6)-nodes in i. 

From these, we can conclude that the ordered-data tree obtained from S projection 
of i is also accepted by <S. 

Below we give a brief summary of the proof adapted from the proof of [Bojanczyk 
ct al. 2009, Proposition 3.10]. We need the following terminologies, all of them are 

from [Bojanczyk ct al. 2009]. 

— Two nodes in a tree are called siblings, if they have the same parent node. 
— The set of all children of a node is called a sibling group. 
— A contagious sequence of siblings is called an interval. 

We write [u, v] for an interval in which u and v are the left-most and right-most 

nodes, respectively, in the interval. 
— An interval is complete, if the following holds. 

— If a node u' exists such that E^{u', u), then u' no u. 

— If a node v' exists such that E^(v, v'), then u' oo u. 
— An interval is pure, if all of its nodes have the same data value. 
— A pure interval with the data value d is called a d-pure interval. 
— If the parent of an interval (or, a sibling group) has data value d, then it is called 

a d-parent interval (or a d-parent sibling group). 
— A zone with the data value d is called a d-zone. 

The construction of t from to is as follows. 

(1) Convert to to another tree t\ such that 

— for every data value d gV^^ there are at most 0{K) complete d-pure intervals 

of size more than 0{K); 
—Vi^{a,{l,p,r),q,b) = Vi^{a,{l,p,r),q,b), for every {a, {l,p,r),q,b) e S x 

{T,±,*yxQxr; 
— ti is an extended tree of its E projection w.r.t. S. 

This step is adapted from [Bojanczyk et al. 2009, Proposition 3.12]. The idea 
is to cut an interval (together with its subtree) and paste it in another interval; 
and while doing so the data values in the interval remain untouched. 

(2) Convert ii to another tree ^2 such that 

— for every data value d gV^^ there are at most 0{K) complete d-parent sibling 
group with more than K'^i^) complete pure intervals; 
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—Vi^{a,{l,p,r),q,b) = Vi^{a,{l,p,r),q,b), for every {a, {l,p,r),q,b) e S x 

{T,±,*}3xQxr; 
— 12 is an extended tree of its E projection w.r.t. S. 

This step is adapted from [Bojanczyk et al. 2009, Proposition 3.14]. Again when 
the cut-and-paste is performed the data values in the sibling groups remain 
untouched. 

(3) Convert f2 to another tree is such that 

— for every data value d € V^^ there are at most 0{K) d-zones containing a 

path with more than 0{K) nodes; 
—Vi3i'^Al,P^r),q,b) = Vi^{a,{l,p,r),q,b), for every {a,{l,p,r),q,b) G S x 

{T,±,*}3xQxr; 
— ^3 is an extended tree of its S projection w.r.t. S. 

This step is adapted from [Bojanczyk et al. 2009, Proposition 3.17]. Again when 
the cut-and-paste is performed the data values in the zones remain untouched. 

(4) Convert ^3 to another tree i4 such that 

— there are at most iC^^^^) complete pure intervals with more than 0{K'^) 
nodes; 

—Vt^ia,{l,p,r),q,b) = Vt^{a,{l,p,r),q,b), for every {a,{l,p,r),q,b) e S x 

{T,±,*}3xgxr; 
— 14 is an extended tree of its S projection w.r.t. S. 

This step is adapted from [Bojanczyk et al. 2009, Proposition 3.20]. Here 
actually when the cut-and-paste is performed, the data values in some zones 
have to be changed. However, those changes are only applied to the safe zones, 
where a zone is safe if for every node in it there is another node outside the 
zone with the same label (from S x {T, _L, *} x Q x F) and the same data value. 
(See [Bojanczyk et al. 2009, page 23, last paragraph].) More specifically, these 
changes are done by applying [Bojanczyk et al. 2009, Lemma 3.19] on the safe 
zones. 

(5) Convert to another tree such that 

— there are at most K^ii^^^ sibling groups containing more than K'^^^^ com- 
plete pure intervals; 

—Vi^{a,{l,p,r),q,b) = Vt^{a,{l,p,r),q,b), for every {a,{l,p,r),q,b) e S x 
{T,±,*}3xQxr; 

^5 is an extended tree of its S projection w.r.t. S. 
This step is adapted from [Bojanczyk et al. 2009, Proposition 3.21]. Here there 
are also changes of data values when performing cut-and-paste. However, as in 

the previous step, they are only applied to the safe zones. These changes are 
also done by applying [Bojanczyk et al. 2009, Lemma 3.19] on the safe zones. 

(6) Convert ^5 to another tree ie such that 

— there are at most K^^^^^ zones containing paths with more than 0{K'^) 
nodes; 

—Vi,ia,{l,p,r),q,b) = Vi^{a,{l,p,r),q,b), for every {a,{l,p,r),q,b) € E x 

{T,±,*}3xgxr; 
— is an extended tree of its S projection w.r.t. S. 
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This step is adapted from [Bojanczyk et al. 2009, Proposition 3.25]. Here there 
are also changes of data values when performing cut-and-paste. However, as 

in the previous step, they arc only applied to the safe zones. More specifically, 
these changes are done by applying [Bojanczyk et al. 2009, Lemma 3.24] on the 
safe zones. 

The extended tree ie is the desired extended tree. It is a rather straightforward 
computation that there are at most K^^^ ' zones in Iq with outdegree > K'<^ ^ . □ 

The reader can immediately notice that those steps are repeated applications of 
"pumping lemma" on both E]^- and iJ_).-directions. We also would like to remark 
that this is the only technique we borrow from [Bojanczyk et al. 2009]. The decision 
procedure for ODTA, which we will present below, differs significantly from the 
procedure in [Bojanczyk et al. 2009]. The decision procedure in [Bojanczyk et al. 
2009] relies on counting the so called dog and sheep symbols (see [Bojanczyk et al. 
2009, page 36]) and it seems that it cannot be generalised to the case of ODTA. 
In this paper our decision procedure relies on Lemma 4.1 and a different counting 
method. In some sense our decision procedure here is a generalisation of the one for 
3MS02(+1,-) for over data words which has been presented in [David et al. 2010]. 
However, the technique [David et al. 2010] works only for the case of words, and it 
does not consider the case of ordered data values. Moreover, it does not work for 
the tree case. 

To describe the decision procedure for Theorem 5.5, we need a few more addi- 
tional terminologies. For a data tree t over the alphabet F, and 5 C F, an 5-zone 

is a zone in which the labels of the nodes are precisely S. We write V^°'^^{S) to 
denote the set of data values found in S'-zones in t. For P C 2^", 

Sep R(^p 

Suppose di < ■ ■ ■ < dm are all the data values in t. The zonal string representation 
of the data values in t, denoted by Vr°"^(t), is the string Pi • • ■ P^ over the alphabet 
2^'' such that for each i G {1, . . . , m}, dii G [P]^"^ 

A zonal S-automaton is S' = (T, M' ■, Fq), where T and Fq are as in the definition 
of ODTA, and M.' is a finite state automaton over the alphabet 2^ . A data tree t 
is accepted by the zonal ODTA S' , if the following holds. 

— Profile(t) is accepted by T, yielding an output tree t' over the alphabet F. 
—The string V^'^'^it') is accepted hy M' . 

— For each a G Fq, all the data values found in the a- nodes in t' are different. 

Proposition 5.10. For every ODTA S, one can construct in ExpTime its 
equivalent zonal ODTA. 

Proof. Let S = {T, M, Fq) and M = {Q, qo, S, F). Its equivalent zonal ODTA 
is defined as S' = {T,M',Ta), where M' = {Q,qo,6',F) and S' = {{q,P,q') G 
Q x2^ X Q \ 3{q, S, q') G 5 such that Uiiep ^ ~ ^ straightforward to show 

that Cdata{S') = Cdata{S)- Notc that the only difference between <S and <S' is the 
transitions 5 and 5' in M. and M' , respectively. □ 
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Briefly our decision procedure for Tlieorem 5.5 works as follows. Let iS = 
{T,M,To) be the given <S-automaton where S be the input alphabet of T, F 
the output alphabet, and Q the set of states of T. Let _fi' = 27-[S|-[(5|-|r|. The 
decision procedure constructs an APC (-4,^) such that S accepts an ordered-data 
tree t in which there are at most K'~"^^ ^ zones with outdegree > K'^^ ^ \{ and only 
if [A, C) accepts the extended tree of t w.r.t. S. 

Its precise description is given as follows. 

(1) Compute K = 27-\Z\- \Q\ ■ \T\. 

(2) Convert S into its zonal ODTA S' = {T,M',Tq). 

(3) Guess the following items. 

(a) A set C 22"". 

(b) For each P guess an integer Mp < 2 • K^^ ■2^ + 2- K^^ + 1 and a 
set of Mp constants Cp = (ci, . . . , cmp}-^ 

(c) Two integers N, N' such that N' < N < K°(^^) . and a set of N' constants 
V {di, . . .,dN'}- 

(Note: The constants in T> may overlap with the constants in Cp's). 

(d) For each d gV, guess a set Pd C 2^. 

(4) Construct the following automaton A over the alphabet S x {T, _L, x Q x F. 

(a) A accepts only the extended trees of C{T) in which there are at most A'' 
zones with outdegree > K*-^ \ 

(b) The automaton A can remember the constants in its states. 

(c) For every P E P, for every c e Cp, the automaton A' "assigns" the constant 
c in an S'-zone, for every S E P, but not in any i?-zone, for every R ^ P. 

(d) The automaton A "assigns" every zone with outdegree > K^^ ^ with a 
constant from V. 

(e) For every d e 2?, for every S G Pd, the automaton A "assigns" the constant 
d in an 5-zone, for every S G Pd, but not in any ii-zone, for every R ^ Pd- 

(f) For each a G Fo, there is at most one a- node in every zone, and for every 
two zones that contains a-nodes, if they are assigned with some constants 
from Cp's and V, then these constants must be different. 

(g) Every two adjacent zones, if they arc assigned with constants from Cp's 
and T>, then these constants must be different. 

The automaton A "assigns" a constant to a zone by remembering the constant 
in the state when A is reading the zone. 

(5) Let Pi, ... , Pm be the enumeration of non-empty subsets of 2'". 

Applying Lemma 2.3, convert the automaton Ai' into its Presburger formula 



^Thc purpose of the number 2 ■ ■ 2^ + 2 • is the appHcation of Lemma 4.1 later on, where 

we consider the graph where the nodes are the zones. Each zone is labeled with a symbol from 
2Ex{T,x,*} xQxr^ which is of size 2^. If a zone has outdegree < then it has only at most 

K^^ ) nodes, which means that its degree (the sum of indegree and outdegree) is bounded by 
2 • if . Now V is intended to contain all those P's in which | [P] f | < 2 • K^^ • 2^ + 2 • K^^ + 1 
so that we can "guess" some constants as elements of [p]zo"e and make sure by automaton that 
the same constant is not "assigned" to adjacent zones. For those P's not in V, we can apply 
Lemma 4.1 to make sure the same data value from [p]|^o"e is not assigned to adjacent zones. 
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^M'i^Pii • • • ) where the intended meaning of zp.'s is the number of ap- 

pearances of the label Pi. 
(6) Let r = {fli, . . . , af} and Si, . . . , Sk bo the enumeration of non-empty subsets 
of r. Consider the formula ^ (x^ , . . . , x^^ , , . . . , x^^. ) : 

3zp, ■ ■ ■ 3zp^ ^_M>{zp^,. . . ,zp^) (2) 

A f\ zp,= Mp, (3) 

A /\ > 2 • • 2^ + 2 • /<r(-f^') + 1 (4) 

scr P,9S and Pi^P 



A 



oGFo there exists 5 such thai 

a G S and 5 G and Pi ^ V 



A ^p, > IKeP|Pd' =Pi}| (7) 



Pi = and del? 

and Pi 

The meaning of Xa is the number of a-nodes occurring in the zone not assigned 
with any constants from Cp's and 2?; and xs is the number S'-zones not assigned 
with any constants from Cp's and T). 

(7) Test the emptiness of the APC {A,C)- 

We should remark here that there is a triple exponential blow-up in the size of A, 
while double exponential blow-up in the size of ^. Since the emptiness of APC is 
in NP, this yields a 3-NExpTiME upper bound for our decision procedure. 
The following claim immediately implies the correctness of our algorithm. 

Claim 1. {i) For every ordered-data tree t e Cdata{S), in which there are at 
most zones with outdegree > K^^ \ its the extended tree is accepted 

by the APC {A, 0- 

{2) For every t' G C{A,^), there exists an ordered-data tree t G Cdata{S) such that 
t' is an extended tree of t w.r.t. S. 

Proof. Wc prove (1) first. Let t £ jCdaLa(S) be an ordered-data tree in which 
there are at most K^^^ ^ zones with outdegree > K^^ \ Let to be the output of 
T on t. 

We have the following items guessed in Step 3 in our algorithm above. 
—V = {P\ |[P]f°"«| < 2 • K^^ ■ 2^ + + 2}. 

—For each P G P, Cp = [P]f°"^ and Mp = \Cp\. 

— be the number of zones in t with outdegree < K^'-'^^ and N' be the number 

of data values found in these zones. 
— V = {d\ d is found in a zone with outdegree > K'-^^^}, 
—For each dGV, Pd is the set such that d e [Pci]f°"^ 
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Now let t' be the extended tree of t with respect to A, and A and ^ be the au- 
tomaton and formula as constructed in Steps 4-6 above. We are going to show 
that t' G C(A,(,). Obviously, t' G C{A). To show that the formula ^ is satisfied, 
we take Parikh(V^°"'=(to)) as witness to {zp„...,zpJ. Since V^°"''(io) G £{M'), 
by Proposition 2.3, the formula ^;^i'(Parikh(V^°"''(to))) holds. It is straightforward 
from the definitions of the items V, Mp's, N, N' , T> and P^'s that the formula ^ 
in Step 6 is satisfied with Xa's and xg's interpreted as intended. 

Now we prove (2). The proof is more delicate than the proof of (1). Suppose 
t' G C{A'X)- 

We are going to construct an ordered-data tree t from t' such that t' is an extended 
tree of t w.r.t. <S. Let V, Mp's, Cp's, A'', A''', V and P^'s the items as guessed in 
Step 3 above and 

— for each G F, let be the number of Oj-nodes in t' occurring in a zone 

without any constants from Cp's and V; 
— for each Si C F, let ng^ be the number of S'j-zones in t' without any constants 

from Cp's and V. 

Suppose (fcpi , . . . , kp^) be the witness to zp^ , . . . , zp^ such that 

?(nai , • • • , , , . . . , ns, ) holds. 

By Proposition 2.3, this means that there exists a word w G i^{.M') such that 
Parikh(w) = {kp^, . . . , kp^). For each Pj, we let 

■^Pi = {j I position j in w is labeled Pj}. 

We will assign a data value to each node in t such that 

[Pi]r'=Mp,, 

and V^""*^ {t) = w. The assignment is done according to two cases below. 

Case 1:. For the nodes that are assigned with some constants from Cp.'s. 
In this case Pi G V. We define bijcctions /p. : Cp. ^ A/'p. . There is always a 
bijection from Cp^ to Np^ since they have the same cardinality Mp. , due to the 
following condition in the formula ^: 

f\ zp^ = Mp^. 

The data value assignment to nodes of this case can be done by replacing every 
constant c G Cp- with /p;(c). 

Case 2:. For the nodes that are assigned some constants from V. 
We define a 1-1 mapping / : I? t-^ {1, . . . , \'w\] such that f{d) G AAp^, where Pd is 
the set guessed in Step 3. Such 1-1 mapping exists because the following condition 
in the formula ^ : 

A > \{d' &^\Pd'= Pi}\ 

Pi = Pd and dec 
and Pi 

The data value assignment to nodes of this case can be done by replacing every 
constant d £V with f{d). 
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Case 3:. For the nodes that are not assigned any constants from Cp's and V. 
First we assign each of such zone in with a data value such that for each 5 C F, 

Pi3S and PifV 

This step can be done as follows. The number of such S-zone in t is greater than 
J2pi3S and Pt^v \-^Pi\' '^^^ *° Condition below in the formula ^: 

Pi3S and Pi<^V 

Thus, we can simply assign every iS-zone with a data value from tJpi3S and Pi^V 
and make sure every data value from Up^gs and Pi^v^ Pi appears in some S-zone. 
However, by assigning data values like that, some adjacent zones may get the 
same data values. Here we apply Lemma 4.1. Since for each Pj ^ V, \M p^\ > 
2 • K^^ ■2^ + 2- K'-^^'i + 1, by the condition below in the formula £, 

f\ zp. >2.i^^'.2^ + 2.i^(^') + l, 

Pi^V 

the cardinality 

I IJ ^^p^\= \Afp^>2.K^" .2^ + K(^"^+3. 

Pi3S and PifV Pi3S and Pi^V 

The degree of each of such zone is < 2 • K^^^\ because the outdegree of such zone 

IS < K^^ \ thus so is the number of nodes in the zone, therefore bounds the degree 
of its zone to < 2 • K^^ K Therefore, by applying Lemma 4.1, we can reassign the 
data value in such zone so that each adjacent zone get different data value. 

This completes the proof of our Claim. □ 
6. WEAK ODTA 

A weak ODTA over S is a triplet S = {T,M,To) where T is a letter-to- letter 
transducer from S to the output alphabet F, and is a finite state automaton over 

2^ and Tq C F. An ordered-data tree t is accepted by S, denoted by t G jC^atai^) ■> 
if there exists an ordered-data tree t' over F such that 

— on input Proj(t), the transducer T outputs t'; 

— the automaton M. accepts the string Vr{t'); and 

— for every a e Fq, all the a-nodes in t' have different data values. 

Note that the only difference between weak ODTA and ODTA is the equality test 
on the data values in neighboring nodes. Such difference is the cause of the triple 
exponential leap in complexity, as stated in the following theorem. 

Theorem 6.1. The emptiness problem for weak ODTA is in NP. 



II A zone in t can be recognised from the profile information in t' 
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Proof. Let S = {T,M, To) be a weak ODTA. Let S, Q, T be the input alphabet, 
set of states and output alphabet of T, respectively. 

We need the following notation. For a tree t G Cdata{S), its extended tree t (with 
respect to the weak ODTA S) is a tree over the alphabet E x Q x F, where 

— the projection of t to S is t; 

— the projection of ? to Q is an accepting run of T on t; 
— the projection of f to F is an output of T on t. 

The decision procedure for Theorem 6.1 works as follows. 

(1) Construct an automaton A over the alphabet S x Q x F for the extended trees 
accepted by T. 

(2) Let r = {Si,. ..,Sm} Q 2^ he the set of symbols used in M. 

By applying Proposition 2.3, construct the Presburger formula i^Si , • • • , xs^) 
for M. 

(3) Let SxQxF = {(ai, gi, ai), . . . , (afe, g„, a^)}. Let (p{x(^ai,qi,ai), ■ ■ ■ ,X{ak,qn,ae)) 
be the following formula: 

ctiGF cciGSj ajero 

(4) Test the emptiness of APC {A,(p{x(a,^qi^ai): ■ ■ ■ ,X(ak,q„,at)))- 

That this procedure works in NP follows directly from the fact that the emptiness 
problem of APC is in NP. 

We now show the correctness of our algorithm by showing that Cdata {S) 7^ if 
and only if C{A, (p) ^ %. (For the sake of presentation, we write ip without its free 
variables.) We start with the "only if" part. Suppose that t e Cdata{S)- We claim 
that the extended tree f of t is accepted by {A,ip). Obviously, i G C{A). To show 
that (p{Pankh{i)) holds, let t' be the F-projection of i. That is, t' is an output of T 
on t. We will show that (p(Parikh(t)) holds. 

— As witness to xsi, ■ ■ ■ , xs^, we take Parikh(V(t')). Since V{t') G jO.{M), by Propo- 
sition 2.3, ,^A^(Parikh(V(t'))) holds. 

— As witness to Xai , • ■ • , Xa^ , we take Parikh(t'). Now for each G F, the constraint 
Xai > SajGSj ^Sj holds sincc the number of data values in the a^-nodes cannot 
exceed the the number of aj-nodes itself. The constraint Xq. = xs^, for 

each aj G Fo, since the data values found in a^-nodes are all different. 

Thus, ip{Par\kh{t)) holds, and this concludes our proof of the "only if" part. 

Now we prove the "if" part. Suppose that i G £.(A, (p). So f G C{A). Let t and 
t' be the S- and F-projection of t, respectively. By the definition of A, t' is an 
output of T on t. Now since v'(Parikh(f)) holds, in particular there exists a witness 
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M = {Ml, . . ., Mm) to xsi, ■ ■ ■ , xs^ such that S^m{M) holds, by Proposition 2.3, 
there exists a word w e C{Ai) over the alphabet 2^" such that Parikh(w) = M. 

We arc going to assign data values to the nodes of t' (thus, also to those of t) such 
that t G Ldatai'S)- The assignment is done as follows. For each 5* C T, let Vyj{S) be 
the set of positions of w labeled with S. Now for each a e F, we assign the a-nodes 
in t' with the data values from IJ^^g Vw{S) such that Vt'{a) = UaeS ^('^)- "^^^^ 
is possible due to the constraint Xa > ^aeS^S- 

With such assignment, we get V{t') = w. Thus, V{t') G C{A4). Moreover, for 
every a G To, all the data values in a- nodes arc different, which follows from the 
constraint Xa = '^aeS^S- Therefore, the resulting ordered-data tree t G Cdatai^)- 
This concludes our proof. □ 

Next, we give the logical characterisation of weak ODTA. 

Theorem 6.2. A language C is accepted by a weak ODTA if and only if C is 
expressible with a formula of the form: 3Xi ■ ■ ■ 3Xm Aip, where (fi is a formula 
from, FO'^{Ei, E^), and ^ is a formula from -F'0(~, -<suc); extended with the 
unary predicates Xi, . . . , X^, ■ 

The proof of Theorem 6.2 is the same as the proof of Theorem 5.4. The differ- 
ence is that to simulate the YQp'iE^, E^) formula ip, the profile information is not 
necessary. 

6.1 Extending weak ODTA with Presburger constraints 

Like in the case of APC, we can extend weak ODTA with Presburger constraints 
without increasing the complexity of its emptiness problem. Let S = {T, A4,Tq) be 
a weak ODTA, where E and F are the input and output alphabets of T, respectively. 
Let F = {ai, . . . , a^}. 

A weak ODTA <S = {T,A4,To) extended with Presburger constraint is a tuple 
{S,0, where ^{xi, . . . ,xi,yi, . . . ,y2t) is an existential Presburger formula with the 
free variables xi, . . . ,Xf,yi, . . . , j/2'^-i- A ordered-data tree t is accepted by {S, 
if there exists an output t' of T on t, the automaton Ai accepts Vr{t'), for each 
a £ Fo, all a-nodes in t' have different data values and ^(Parikh(i'), Parikh(Vr(t'))) 
holds. We write £data{S,£,) to denote the set of languages accepted by (S,^). 

It should be immediate that the emptiness problem of weak ODTA extended 
with Presburger constraint is still decidable in NP. 

6.2 Comparison with other known decidable formalisms 

We are going to compare the expressiveness of weak ODTA with other known 

models with decidable emptiness. 

6.2.1 DTD with integrity constraints. An XML document is typically viewed 
as a data tree. The most common XML formalism is Document Type Definition 
(DTD). In short, a DTD is a context free grammar and a tree t conforms to a DTD 
D, if it is a derivation tree of a word accepted by the context free grammar. 

The most commonly used XML constraints are integrity constraints which are of 
two types. 

ACM Transactions on Computational Logic, Vol. V, No. N, December 2012. 



34 • Tony Tan 



— The key constraints arc constraints of the form: 

\/xiy{a{x) A a{y) Ax^y^x = y), 

denoted by key (a). 
— The inclusion constraints are constraints of the form: 

\/x3y{a{x) — >■ b{y) Ax ^ y), 

denoted by V{a) C V{b). 

The satisfiabiUty problem of a given DTD D and a collection C of integrity con- 
straints asks whether there exists an ordered-data tree t that conforms to the DTD 
that satisfies all the constraints in C. In [Fan and Libkin 2002] it is shown that this 
problem is NP-complete. 

Theorem 6.3. Given a DTD D and a collection C of integrity constraints, one 

can construct a weak ODTA S such that Cdata{S) is precisely the set of ordered-data 
trees that conforms to D and satisfies all constraints in C. 

Proof. Let S be the alphabet of the given DTD D. Consider the following 
weak ODTA S={T,M, ^o). 

— T is an identity transducer that checks whether the input tree conforms to DTD 
D. 

— M. is an automaton that accepts V* , where 7^ = 2^ — {S\a&S and h ^ 

S for some V{a) C V{h) G C). 
— So = {a I key{a) G C). 

That S is the desired ODTA follows immediately from the fact that for every 
ordered-data tree t, Vt{a) C Vt{b) if and only if [S]t = for all S where a G S, but 
b^S. □ 

It must be noted that our construction in Theorem 6.3 outputs an automaton ^A 

of exponential size. This blow-up is tight, as the following example shows. Consider 
the case where C does not contain inclusion constraints. That is, C contains only 
key constraints. Then any equivalent ODTA S = (T, A^,So) will have jC{Ad) = 
(2^ — {0})*. Thus, we have exponential blow-up in the size of Ai. Nevertheless, if 
we are concerned only with satisfiability, then we can lower the complexity to NP 
as stated in the following theorem. 

Theorem 6.4. Given a DTD D and a collection C of integrity constraints, one 

can construct a weak ODTA S in nan- deterministic polynomial time such that 
^data{S) ^ ^ if and only if there exists an ordered-data tree t that conforms to 
D and satisfies all the constraints in C. 

Proof. Let S be the alphabet of the DTD D. We non-deterministically con- 
struct a weak ODTA <S = {T,M, Eq) as follows. 

— T is an identity transducer that checks whether the input tree conforms to DTD 
D. 

— Guess an ordering ai, . . . ,ak of the elements in E such that if V{ai) C V{aj) G C, 
then i < j. 
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Intuitively, this is an ordering of elements in E that respect the inclusion con- 
straints in C. 

— Let Si, . . . , Sfe C S be such that Si — T, — {ai, . . . , Ui-i}. Note that 5*1 = E. 
— Ai is a non-deterministic automaton over the alphabet {iSi, . ..,Sk}, where the 

set of states is {gi, . . . , qk}, all gi, . . . , g/c are the initial states and the final states, 

and the transitions are: (g,, Sj, qj) for every 1 < i < j < k. 

— Eo = {a 1 key{a} E C}. 

We claim that LdataiS) ^ if and only if there exists an ordered-data tree t that 
conforms to D and satisfies all the constraints in C. 

We start with the "if" direction. Suppose t conforms to the DTD D and satisfies 
all the constraints in C. For each a G E, let iVo be the number of data values found 
in the o-nodes in t. Let (ai, . . . , Ofc) be the ordering of the elements of E such that 
i < j a and only if iVa^ < Na^ . 

Consider the following ordered-data tree t' over E, where t' is obtained from t 
by reassigning the data values on the nodes in t as follows. For each a e E, we 
assign the set of integers {d \ 1 < d < Na} nii the data values of a- nodes in t' . Such 
assignment is possible since Na is no more than the number of a-nodes in t'. With 
such assignment t' still obeys the constraints in C. 

— If key{a) G C, then Na is precisely the number of a-nodes in t, thus, also in t'. 
Thus, with the data values {1, . . . , A^^}, the data values on the a-nodes in t' are 
all different. 

—If V{a) C V{a') G C, then obviously, Na < Na>. Thus, t' still satisfies the 
constraint V{a) C V{a'), since the data values in a-nodes in t' are {1,2,..., Na}, 
while those in a'-nodcs arc {1.2,..., Na/}- 

Now the string V{t') — Ri ■ ■ ■ R^, where m = maxags(-^a) and Ri ^ R2 ^ ■ ■ ■ ^ 
Rm, thus, accepted by A4. That t is accepted by T is trivial and so is the fact that 
all the data values found in a-nodes in t' for each a G Eq. Thus, t' G CdaLa{S). 

For the "only if" direction, it is sufficient to observe that for every ordering 
(ai, . . . ,afe) that "respects" the inclusion constraints in C, if V{t) G jC{A4), then t 
satisfies all the inclusion constraints in C. This completes our proof. □ 

6.2.2 Set and lineM,r constraints for data trees. In the paper [David ct al. 2012] 
the set and linear constraints are introduced for data trees. As argued there, those 
constraints, together with automata, are able to capture many interesting properties 
commonly used in XML practice. Wc^ review those constraints and show how they 
can be captured by weak ODTA extended with Presburger constraints. 

Data-terms (or just terms) are given by the grammar 

r := V{a) | r U r | r n r | r for a G E. 
The semantics of r is defined with respect to a data word t: 

lV{a)\ = Vt{a) lrl^ = Vt-lT\ 

in n = [nl, n {t^I, [n u r^l, = [nl, u [T2], 

Recall that Vt = Uaes ^(^) ~ the set of data values found in the data tree t. 

A set constraint is either r = or r 7^ 0, where r is a term. A data tree t satisfies 
r = 0, written as f |= r = 0, if and only if |t]j = (and likewise for r ^ 0). 

ACM Transactions on Computational Logic, Vol. V, No. N, December 2012. 



36 • Tony Tan 



A linear constraint ^ over the alphabet S is a linear constraint on the variables 
Xa, for each a G S and zs, for each 5 C S. A data tree t satisfies ^, if (, holds by 
interpreting Xa as the number of a-nodes in t, and zs the cardinality \[S]t\- 

Theorem 6.5. Given a tree automaton A and a set C of set and linear con- 
straints, there exists a weak ODTA {S, ip) extended with Presburger constraints 
such that Cdata{S,<p) is precisely the set of ordered-data trees accepted by A that 
satisfies all the constraints in C. 

Proof. The proof is simply a restatement of the proof in [David ct al. 2012] into 
a language of weak ODTA. We need the following notation. For a data term r, we 
define a family S(t) of subsets of S as follows. 

—If r = V{a), then §(r) = {S \ a € S and S CS}. 
—If T = Ti, then S(t) = 2^ - S(ti). 

— If r = Ti ★r2, then §(r) = S(ti) *S(r2), where ★ is fl or U. 

It follows that for every data tree t, we have |[r]j = Uses(r)['^]*- R-ecall that the 
sets [S]ts are disjoint. 

The desired S = {T,M,T,o) is defined as follows. The transducer T is the 
identity transducer A, and Sq = 0. The automaton M accepts a word v G (2^)* if 
and only if 

— for every set constraint r = 0, w does not contain any symbol from §(t); 
— for every set constraint t ^ 0, w contains at least one symbol from S(t). 

The formula ^ is the conjunction of all the linear constraints in C. 

That >Cd(jta(<S, ^) is indeed precisely the set of ordered-data trees accepted by 
A that satisfies all the constraints in C follows immediately from the definition of 
§. □ 

6.2.3 Fd^{+l,^suc) over text. Here we focus our attention on ordered-data 
words, which can be viewed as trees where each node has at most one child. We 
write w = (^j) ■ • ■ (^") to denote ordered-data word in which position i has label 
Ui and data value di. It is called a text, if all the data values are different and the 
set of data values {di, . . . , rf„} is precisely {1, . . . , n}. 

It is shown in [Manuel 2010] that the satisfaction problem for F0'^{+1, <suc) 
over text is decidable.** The following theorem shows that this decidability can be 
obtained via weak ODTA. 

Theorem 6.6. For every formula (p G Fd^{+1, <suc), one can construct effec- 
tively a weak ODTA S such that 

—for every text w, ifwe Cdata{<p), then w € Cdata{S); 

— for every ordered-data word w G JCdata{S), there exists a text w' € jCdataif) such 
that Proj{w) = Proj{w'). 



**Thc definition of text in [Manuel 2010] is sligfitly different, but it is equivalent to our definition. 
However, it turns out that the key lemma proved in [Manuel 2010] has a serious gap, which is 
filled later on in [Figueira 2012). The final result is still correct though. 
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Proof. In [Manuel 2010], the decidability is proved by constructing its so called 
text automata, also defined in [Manuel 2010]. We review the precise definition here. 
Let w = (^j) • • • (^") be a text over the alphabet S. Therefore, V{iu) = 5i • • • S'n is 
such that each Si is a singleton. 

We define msp{w), the marked string projection of w, as the word (ao, bo) ■ ■ ■ {an, bn), 
where 6, e {—1, 1, *} and 

{—1 ifl<*<^ and d^+i -\- \ = di 
1 if 1 < z < n and di + 1 = di+i 
* otherwise 

A text automaton over the alphabet S is pair (Ti, T2), where 

— Ti is a non-deterministic leter-to-letter word transducer with the input alphabet 

S X {—1,1.*} and the output alphabet F. 
— T2 is a non-dctcrministic finite state automaton over S'. 

A text = (dj) • • • (d") is accepted by the text automaton (Ti, T2), if 

— msp(w) is accepted by Ti, yielding a string ai • • • Q!„; 

— the string ■ • ■ is accepted by T2, where the indexes zi, . . . , i„ are such that 
1 = < rfjj < • • • < = n. 

It is shown in [Manuel 2010] that for every G F0^(-|-1, -<suc), one can construct 
effectively a text automaton A such that for every text w, w € jCdataiv) if and only 

if W e Cdata{A). 

Now we are going to show how to get the desired ODTA S — (T, Ai,r). Let 
{Ti,T2) be the text as above. On input ordered-data word w = (^j) ••• (^"), <S 
performs the following. 

— The automaton T simulates Ti, by guessing msp{w) and outputs its F-projection, 

while store its {—1,1, *}-projection in its states. 
— The automaton Ai is simply T2. 

It is straightforward to see that such S is the desired weak ODTA. □ 
7. AN UNDECIDABLE EXTENSION 

In this section we woiild like to remark on an undecidablc extension of weak ODTA. 
Recall the language in Example 1. It has already noted in the proof of Proposi- 
tion 5.2 that its complement is not accepted by any ODTA. Formally, the comple- 
ment of the language in Example 1 can be expressed with formula of the form: 

Va; Vy \J a{x) A \/ a{y) A E^*{x, y)^x^y, (8) 

where Sq C S and E^* denotes the transitive closure of E]^. It can already be 
deduced from [Bojanczyk et al. 2011, Proposition 29] that given an ODTA and a 
collection C of formulas of the form (8), it is undecidable to check whether there is 
an ordered-data tree t e Cdata{S) such that t\= ip, for all V' G C. 

At this point we would also like to point out that extending ODTA with operation 
such as addition on data values will immediately yield undecidability. This can be 
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deduced immediately from [Halpern 1991] where we know that together with unary 
predicates, addition yields undecidability. 

8. WHEN THE DATA VALUES ARE STRINGS 

In this section we discuss data trees where the data values are strings from {0, 1}*, 
instead of natural numbers. We call such trees string data trees. There are two 
kinds of order for strings: the prefix order, and the lexicographic order. Strings 
with lexicographic order are simply linearly ordered domain, thus, ODTA can be 
applied directly in such case. 

For the prefix order, we have to modify the definition of ODTA. Consider a string 
data tree t over the alphabet S. Let Vt be the set of data values found in t. We 
define Vs(i) as a tree over the alphabet 2^, where 

— Dom(Vs(t)) is T4U{e}; 

— for u,v G Dom{Vs{t)), u is a parent oi v ii u is a prefix of v and there is no 

w e Dom{Vj:{t)) such that u is a prefix of w and w is a prefix of v; 
— for u G Dom(Vs(t)) the label of u is 5, if u G [S]t; and ROOT, if u = e. 

We call Vs(t) the tree representation of the data values in t. Consider an example 
of a string data tree in Figure 2. We have 

[{a}]t = {0101} [{b}]t = {0100} 
[{c}]* = {01011} [{a,b}]t = m 
[{b, c}]t = {01000} [{a, b, c}]t = {010011}. 

So Dom(Vs(i)) = {01, 0100, 0101, 010011, 010000, 01011}, and 

—01 is the parent of 0100 and 0101: 

—0100 is the parent of 010011 and 010000; and 

—0101 is the parent of 01011. 

Now an ODTA for string data trees is 5 = (T, A, Fq), where T is a letter-to-letter 
transducer from S x {T, _L, to F; A is an unranked tree automaton over the 
alphabet 2^"; Fq C F. The requirement for acceptance is the same as in Section 5, 
except that A takes a tree over the alphabet 2^" as the input. All the results in 
Sections 5 and 6 can be carried over immediately to this model. 

9. CONCLUDING REMARKS 

In this paper we study data trees in which the data values come from a linearly 
ordered domain, where in addition to equality test, we can test whether the data 
value in one node is greater than the other. We introduce ordered-data tree au- 
tomata (ODTA), provide its logical characterisation, and prove that its emptiness 
problem is decidable. We also show the logic 3MS0^(i?^, i?^, ^) can be captured 
by ODTA. 

Then we define weak ODTA, which essentially are ODTA without the ability to 
perform equality test on data values on two adjacent nodes. We provide its logical 
characterisation. We show that a number of existing formalisms and models studied 
in the literature so far can be captured already by weak ODTA. We also show that 
the definition of ODTA can be easily modified, to the case where the data values 
come from a partially ordered domain, such as strings. 
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Fig. 2. An example of a string data tree (on the left) and the tree representation 
of its data values (on the right) . 

We believe that the notion of ODTA provides new techniques to reason about 
ordered-data values on unranked trees, and thus, can find potential applications 
in practice. We also prove that ODTA capture various formalisms on data trees 
studied so far in the literature. As far as we know this is the first formalism for 
data trees with neat logical and automata characterisations. 
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